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Abstract 

Classification  of  time  series  has  wide  Air  Force,  DoD  and  commercial  interest,  from 
automatic  target  recognition  systems  on  munitions  to  recognition  of  speakers  in  diverse 
environments.  The  ability  to  effectively  model  the  temporal  information  contained  in  a 
sequence  is  of  paramount  importance.  Toward  this  goal,  this  research  develops  theoretical 
extensions  to  a  class  of  stochastic  models  and  demonstrates  their  effectiveness  on  the  prob¬ 
lem  of  text-independent  (language  constrained)  speaker  recognition.  Specifically  within  the 
hidden  Markov  model  architecture,  additional  constraints  are  implemented  which  better 
incorporate  observation  correlations  and  context,  where  standard  approaches  fail.  Two 
methods  of  modeling  correlations  are  developed,  and  their  mathematical  properties  of  con¬ 
vergence  and  reestimation  are  analyzed.  These  differ  in  modeling  correlation  present  in 
the  time  samples  and  those  present  in  the  processed  features,  such  as  Mel  frequency  cep- 
stral  coefficients.  The  system  models  speaker  dependent  phonemes,  making  use  of  word 
dictionary  grammars,  and  recognition  is  based  on  normalized  log-likelihood  Viterbi  decod¬ 
ing.  Both  closed  set  identification  and  speaker  verification  using  cohorts  are  performed  on 
the  YOHO  database.  YOHO  is  the  only  large  scale,  multiple-session,  high-quality  speech 
database  for  speaker  authentication  and  contains  over  one  hundred  speakers  stating  combi¬ 
nation  locks.  Equal  error  rates  of  0.21%  for  males  and  0.31%  for  females  are  demonstrated. 
A  critical  error  analysis  using  a  hypothesis  test  formulation  provides  the  maximum  number 
of  errors  observable  while  still  meeting  the  goal  error  rates  of  1%  False  Reject  and  0.1% 
False  Accept.  Our  system  achieves  this  goal.  This  research  supports  the  many  new  elec¬ 
tronic  applications  requiring  speech-based  biometric  authentication  such  as  secure  access 
control, 
science, 


telephone-based  recognition,  transaction  or  credit  account  verification,  forensic 
law  enforcement  and  military  intelligence. 
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GENERALIZED  HIDDEN  FILTER  MARKOV  MODELS 
APPLIED  TO  SPEAKER  RECOGNITION 

I.  Introduction 

1.1  Historical  Overview 

Classification  of  time  series  has  wide  Air  Force,  DoD  and  commercial  interest,  from 
automatic  target  recognition  systems  on  munitions  to  recognition  of  speakers  in  diverse 
environments.  The  ability  to  effectively  model  the  temporal  information  contained  in  a 
sequence  is  of  paramount  importance.  Toward  this  goal,  this  research  develops  theoretical 
extensions  to  a  class  of  stochastic  models  and  demonstrates  their  effectiveness  on  the  prob¬ 
lem  of  text-independent  (language  constrained)  speaker  recognition.  Specifically  within  the 
hidden  Markov  model  architecture,  additional  constraints  are  implemented  which  better 
incorporate  observation  correlations  and  context,  where  standard  approaches  fail. 

The  speech  signal  contains  a  great  deal  of  information  more  than  just  a  sequence 
of  words.  It  contains  the  acoustic  environment  (car,  aircraft,  machinery,  office  noise), 
gender,  prosody  (pitch  changes,  syllable  stress,  loud  or  soft  speech,  emotional  state  of  the 
speaker),  language,  dialect  or  ethnic  characteristics  and  speaker  information.  This  latter 
information  in  the  speech  signal  is  desired  and  exploited  for  a  speaker  recognition  system. 
Speaker  recognition  applications  include  closed-set  identification,  open-set  identification 
and  verification.  With  the  electronic  age,  there  comes  many  new  applications  for  biometric 
authentication,  in  addition  to  forensic  science  [49,  50],  security  access  and  specific  military 
requirements  [129]. 

A  speaker  has  two  biological  areas  of  uniqueness  [85].  These  include  the  vocal  physi¬ 
ology  and  the  learned  neural  control  of  the  articulators  which  control  the  physiology.  The 
first  area  includes  such  physical  factors  as  length  of  vocal  tract;  size  of  mouth  and  nasal 
cavities;  glottal  size,  shape  and  pulse  patterns;  and  teeth  and  lip  characteristics.  The 
second  area  includes  the  learned  habits  of  these  facilities  such  as  dialect  or  regional  ac¬ 
cents,  pronunciation  or  ethnic  traits,  and  speed  and  timing  of  the  articulators.  The  latter 
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neural  code  may  never  be  modeled  directly,  but  the  overall  effect  shows  up  eventually  in 
the  dynamics  of  acoustic  signal,  such  as  formant  transitions  and  coarticulation  effects.  In 
fact,  anatomical  models  attempting  to  estimate  control  of  vocal  articulators  have  been 
proposed  for  speech  recognition  [43,  44,  118].  For  the  receiver,  biological  acoustic  phe¬ 
nomena  also  support  the  value  of  classification  of  speech  and  speakers  using  a  temporal 
model.  Auditory  psychoacoustic  studies  provide  a  wealth  of  examples  that  relate  specific 
temporal  changes  in  the  acoustic  signal  to  a  specific  auditory  event  [82,  83]  or  measured 
electro-chemical  response  [117,  132].  Together,  both  the  effects  of  physiology  and  the 
learned  neural  traits  dynamically  alter  the  acoustic  spectrum  through  formant  transitions 
and  coarticulation  effects;  the  ability  to  accurately  model  these  spectra  should  be  useful 
for  speaker  recognition. 

Historically,  speaker  recognition  has  made  use  of  techniques  borrowed  from  speech 
recognition  research.  Distortion  based  methods  were  first  chosen  to  compare  speaker  spec¬ 
tral  representations.  These  methods  used  long  term  spectral  averages  as  a  representation 
[116].  Later,  some  form  of  dynamic  time  warping  was  used  for  text-dependent  applications 
[130],  allowing  recognition  of  previously  recorded  utterances.  Depending  on  the  extracted 
features,  certain  distortions  or  metrics  were  proposed  which  were  optimal  for  those  features 
[55,  95].  Similarity  of  test  speech  to  speaker  models  was  based  on  overall  distance  or  distor¬ 
tion.  In  the  mid-1980’s,  Soong  [119]  proposed  a  clustering  approach  for  text-independent 
applications.  This  classic  approach  will  be  referred  to  as  vector  quantization  (VQ)  since  a 
clustering  of  a  speaker’s  training  features/  vectors  becomes  the  model  and  classification  is 
determined  by  minimum  quantization  error.  Many  successful  applications  and  variations 
of  this  procedure  have  been  accomplished  [22,  23,  38,  61,  79,  120,  135].  Vector  quantization 
assumes  each  observation  is  independent  in  time,  clearly  not  true  for  speech  signals. 

Over  the  last  decade,  the  predominant  speech  recognizers  have  been  based  on  the 
hidden  Markov  model  (HMM),  first  pioneered  by  Baum  and  his  colleagues  [7,  8,  9,  10]  and 
soon  thereafter  applied  to  automatic  speech  recognition  (ASR)  [96].  This  statistical  model 
is  complex  enough  to  model  the  variability  of  the  speech  waveform,  yet  simple  enough  for 
its  parameters  to  be  estimated  [16].  The  HMM  framework  provides  efficient  Maximum 
Likelihood  (ML)  reestimation/  training  algorithms  with  desirable  properties  and  methods 
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to  model  and  decode  /  recognize  the  many  levels  of  speech  -  acoustic,  phoneme,  word  and 
language.  Speaker  recognition,  for  instance  when  needing  personal  identification  numbers 
(PIN)  or  passwords,  may  need  to  perform  both  speech  and  speaker  recognition.  The  ability 
to  remove  the  effects  of  the  word  sequence  and  extract  speaker  dependencies  alone  is  an 
unsolved  problem.  With  the  increasing  performance  of  hidden  Markov  models  on  speech 
recognition,  several  researchers  started  examining  these  statistical  techniques  for  automatic 
speaker  recognition. 

Poritz  [92]  was  one  of  the  first  to  pioneer  hidden  Markov  models  for  speaker  identifi¬ 
cation  as  well  as  a  hidden  filter  method,  though  his  results  were  preliminary.  In  the  early 
1990’s,  Tishby  [124]  extended  these  hidden  filters,  complete  with  multiple  mixtures  [57,  60]. 
His  results  indicated  that  the  transitions  (temporal  structure)  of  the  hidden  Markov  chain 
was  unnecessary.  Furui  [113]  later  compared  vector  quantization  (VQ)  codebooks  to  the 
ergodic  HMM  structure  and  also  concluded  that  output  density  mixture  numbers  alone 
where  responsible  for  performance.  In  effect,  these  researchers  concluded  that  only  model¬ 
ing  the  spectrum  of  a  speaker,  and  not  the  temporal  patterns  of  the  spectrum,  alone  was 
necessary  for  recognition.  This  appears  to  contradict  the  second  well-known  characteristic 
of  voice  differences,  namely  the  speaking  habits  and  learned  patterns  of  speech.  Levinson 
[72]  has  pointed  historically  to  key  experiments,  including  Markov  himself,  which  demon¬ 
strated  certain  HMM  architectures  will  learn  the  structure  of  the  language  itself.  Thus, 
specific  architectures  of  an  HMM  may  not  be  well-suited  to  model  speaker  dependencies. 

Another  related  approach  making  an  observation  independence  assumption  is  the 
Gaussian  Mixture  Model  (GMM),  pioneered  by  Reynolds  [101,  102,  103,  104].  In  this 
model,  a  speaker’s  spectral  vectors  are  represented  by  a  mixture  of  multivariate  normal 
densities,  reestimated  using  the  Expectation  Maximization  algorithm  of  Dempster  [28]. 
The  GMM  assumes  no  temporal  structure  within  the  signal  and  can  be  considered  a  special 
case  of  the  more  general  HMM,  with  a  single  state.  Each  of  these  researchers  applied  the 
hidden  Markov  models  to  unlabeled  speech,  where  a  single  model  represented  all  possible 
speech  interactions  and  transitions.  These  methods  sharply  contrast  to  speech  recognition 
where  tens  of  phoneme  models  or  thousands  of  context  dependent  tri-phone  models  are 
required. 
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Hidden  Markov  models  make  erroneous,  simplifying  assumptions  in  the  dynamics 
of  the  speech  observations.  Existing  models  assume  speech  features  are  generated  by  a 
discrete  state  Markov  process.  Furthermore,  the  observations  are  the  result  of  a  proba¬ 
bilistic  function  of  this  hidden  process  and  considered  conditionally  independent.  It  seems 
intuitive  that  past  and/  or  future  observations  provide  extra  information  concerning  the 
context  of  the  current  realization.  In  order  to  improve  upon  current  statistical  techniques, 
this  independence  assumption  must  be  removed.  Recently,  several  researchers  have  been 
relaxing  these  assumptions.  Methods  such  as  multi-layer  perceptrons  (MLP)  and  other 
neural  network/  HMM  hybrids  [13,  26]  have  emerged  for  speech  recognition  though  they 
have  required  specialized  hardware  for  training.  Others  have  proposed  linear  predictive 
densities  [66,  131]  or  joint  normal  densities  [16]  for  speech  recognition  though  they  have 
showed  little  improvement.  Still  others  have  tried  polynomial  representations  [29]  and 
Kalman  filtering  approaches  [33].  An  original  contribution  of  this  work  includes  modeling 
speaker  dependent  phonemes  by  the  use  of  Markov  modulated  rational  filters. 

Speaker  recognition  continues  to  be  a  potential  application  area  for  better  time  series 
modeling,  attracting  entire  workshops  [113]  and  recent  dissertations  [18,  102].  This  time 
series  provides  a  challenge  since  channel  and  recording  instrumentation,  effects  of  particular 
text,  prosody  and  speaker  variability  add  to  the  classification  difficulty.  Recently  at  an 
international  conference  focusing  on  speaker  recognition  research,  Purui  supported  this 
dissertation’s  approach  stating  [113], 

As  fundamental  research,  it  is  important  to  pursue  a  method  for  extracting 
and  representing  the  speaker  characteristics  that  are  commonly  included  in  all 
the  phonemes  irrespective  of  the  speech  text.. . .  It  is  expected  that  diversified 
research  related  to  speaker-specific  information  in  speech  waves  will  become 
more  active  in  the  near  future. 

Lastly,  the  contributions  of  accurately  modeling  speakers  may  provide  for  better 
speech  recognition.  Speaker  adaptation  are  the  methods  used  to  transform  a  speaker 
independent  (SI)  speech  recognizer  for  a  particular  speaker.  Large,  accurate  speech  models 
require  large  amounts  of  training  data,  and  it  is  often  impractical  and  impossible  to  acquire 
enough  training  data  for  each  speaker.  Instead,  speech  from  many  speakers  is  used  to 
train  a  speaker  independent  recognizer,  then  these  models  are  adapted  to  become  speaker 
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dependent  (SD).  Research  in  speaker  modeling  provides  valuable  insight  to  solutions  into 
this  adaptation. 

1.2  Prohlem  Statement  and  Scope 

A  complete  framework  which  encompasses  many  older  and  newly  developed  models 
of  discrete  state  dynamic  systems  will  be  created.  New  analysis  and  reestimation  of  several 
classes  of  linear  functions  within  a  hidden  Markov  model  will  be  accomplished.  Specifically, 
probabilistic  linear  functions  of  a  hidden  Markov  process  will  account  for  context  and 
correlation  in  the  observations.  These  new  models  will  then  be  applied  to  the  difficult 
problem  of  modeling  speaker  dependencies  within  language-constrained  (digits)  speech. 

1.2.1  Scope.  Existing  automatic  speaker  recognition  methods  do  not  model  the 
spectral  phoneme-level  dynamics,  since  the  current  models  assume  observations  are  statis¬ 
tically  independent.  Past  methods  have  attempted  modeling  speakers  by  either  assuming 
1)  independent  observations,  2)  models  assuming  state-conditional  independent  observa¬ 
tions  or  3)  architectures  which  grossly  estimated  language  and  grammar  dynamics.  This 
has  left  a  large  window  of  opportunity  for  extensions  of  the  current  statistical  models. 
Whether  the  goal  is  to  classify  a  sequence  of  observations,  predict  a  time  series,  or  uncover 
the  hidden  “state”  of  a  system,  this  research  has  great  relevance.  This  research  addresses 
the  reestimation  of  generalized  statistical  models  for  eventual  classification  of  time  series, 
and  in  particular  applying  these  to  speaker  dependent  phoneme  modeling. 

1.2.2  Research  Contributions.  Toward  successful  accomplishment  of  these  prob¬ 
lems,  a  number  of  original  research  contributions  have  been  completed.  These  include: 

Generalized  Hidden  Filter  Architecture.  A  complete  framework 

including  many  existing  linear  and  nonlinear  systems  used  for  classification,  as  well  as 
prediction,  is  developed  for  discrete  state  Markov  models.  The  existing  hidden  Markov 
model  independence  assumptions  are  reviewed  and  removed,  thus  defining  a  new,  more 
generalized,  hidden  filter  Markov  model. 
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New  reestimation  methods  are  provided  for  autoregressive  (AR)  and  autoregressive 
moving  average  (ARMA)  as  well  as  an  optimal  initialization  strategy.  This  models  are 
allowed  nonzero  biases  and  either  state-conditioned  or  common  noise  statistics.  The  ability 
to  reestimate  these  filters  adequately  for  the  difficult  ergodic  case  is  novel  and  shown  by 
example.  This  new  class  of  ARMA  Markov  modulated  hidden  filters  is  applicable  to  specific 
broad  classes  of  phonemes,  with  a  spectral  zero  component.  Lastly,  filters  operating  on 
frames  of  speech  have  been  extended  from  simple  architectures  to  multi-state  phoneme 
models. 

Vector  Autoregressive  Hidden  Filters.  The  extension  from  sample  or 
frame  based  filters  to  full  vector  autoregressive  hidden  filters  is  developed  with  an  emit-on- 
state  notation.  Several  variations  of  the  model  include  the  regression  characteristics  of  each 
vector  element  on  past  elements  and  noise  correlation.  The  choice  of  spectral  features,  the 
Mel  frequency  cepstral  coefficients,  dictate  a  diagonal  matrix  filter,  with  a  least-squared 
solution  developed  within.  A  procedure  of  a  posteriori  mean  removal  is  developed  to 
separate  the  state  mean  estimation  from  the  filter  coefficients  for  numerical  stability. 

HMM  and  Hidden  Filter  Analysis.  A  new  proof  of  monotonic 

convergence  for  Gaussian  mixtures  is  presented  using  a  new  equivalence  model  paradigm.  A 
new  proof  of  monotonic  convergence  for  hidden  filter  Markov  models  is  then  demonstrated. 
An  application  of  the  Markov  property  of  .the  observations  for  hidden  filter  models  is  applied 
to  the  Fielding  [42]  information  theoretic  proof.  Since  pattern  recognition  methods  seek 
ways  which  reduce  entropy  (to  reduce  classification  errors),  this  new  theorem  justifies  the 
hidden  filter  model  over  standard  hidden  Markov  models. 

Phonetic  Modeling  for  Speaker  Recognition.  The  extensive  Linguistic 
Data  Consortium  (LDC)  YOHO  database  is  used  for  all  experimentation.  A  speaker 
dependent  phoneme-based  hidden  filter  Markov  model  approach  is  accomplished  for  both 
speaker  identification  and  verification.  The  most  current  speech  recognition  tools  are 
incorporated  such  as  phonetic  labeling,  word  dictionaries,  bi-word  language  models  and 
Viterbi  scoring  constraints.  The  method  of  forced  Viterbi  decoding  of  phoneme  based 
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temporal  models  for  speaker  verification  is  the  first  to  be  published.  Likelihood  ratio 
normalization  using  cohorts  is  accomplished  and  error  rates  shown  using  a  newly  developed 
second  order  cohort  selection  strategy.  A  unique  critical  error  analysis  is  provided  for 
YOHO  at  the  mixed  5%  and  25%  significance  levels  for  false  acceptance  and  false  rejection 
target  error  rates,  respectively. 

Many  current  techniques  apply  models  which  assume  independent  observations  or 
do  not  target  the  dynamics  present  in  speech  or  the  processed  speech  vectors.  Those 
techniques  which  do  attempt  to  model  the  dynamical  properties  have  not  targeted  individ¬ 
ual  phonemes.  Our  state-of-the  art  approach  develops  state-dependent  dynamic  systems 
within  phoneme  for  speaker  recognition,  providing  equal  error  rates  of  0.21%  for  males 
and  0.31%  for  females.  These  error  rates  have  also  been  shown  to  statistically  satisfy  the 
hypothesis  that  our  system  meets  or  beats  the  U.S.  Government  target  error  rates  of  1% 
false  rejection  and  0.1%  false  acceptance. 

1.3  Dissertation  Organization 

This  document  is  organized  into  six  main  chapters.  The  following  chapter  provides 
background  material  concerning  hidden  Markov  models  theory  and  several  recent  devel¬ 
opments.  It  provides  a  new  architecture  unifying  many  other  techniques.  Chapter  III 
develops  the  reestimation  equations  for  hidden  filter  Markov  models,  at  the  scalar  (sam¬ 
ple  and  frame)  and  vector  (feature)  levels.  In  Chapter  IV,  the  analysis  of  the  monotonic 
likelihood  reestimation  is  demonstrated  along  with  an  information  theoretical  justification 
for  the  hidden  filter  model.  Chapter  V  provides  an  in-depth  analysis  of  phonetic  hidden 
filter  Markov  modeling  approach  to  speaker  recognition.  The  final  chapter  offers  several 
research-directed  recommendations  and  conclusions  with  a  brief  review  of  contributions. 
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II.  Background 


2. 1  Introduction 

This  chapter  introduces  the  hidden  Markov  model  and  several  extensions  for  use  in 
modeling  speech  and  speakers.  Given  the  intra-speaker  variability  of  speech  over  a  set  of 
words,  a  statistical  model  which  attempts  to  estimate  these  variabilities  presents  the  best 
solution.  The  HMM  makes  use  of  a  hidden  changing  state,  where  the  state  may  represent 
some  particular  spectrum  of  speech  or  some  dynamics  of  this  spectrum.  The  next  section 
describes  the  theory  underlying  standard  HMMs,  and  the  assumptions  often  made.  Next, 
the  assumptions  are  relaxed  to  model  the  dynamics  of  speech,  for  both  frames  of  speech 
and  processed  features.  The  last  section  exemplifies  the  typical  processing  of  speech  for 
extracting  features  and  analyzes  their  independence.  Lastly,  a  linear  method  to  extract 
transitional  information  of  the  feature  process  is  provided. 

2.2  Statistical  Hidden  Markov  Models 

Consider  a  source  system  which  traverses  between  N  hidden  states  or  characteristic 
modes,  denoting  this  sequence  as  qi,  qr,  where  qt  G  {1,2, ...  ,N}.  This  sequence  is 

a  Markov  chain  and  will  be  assumed  to  be  a  discrete  first  order  Markov  process.  As  such 
its  behavior  can  be  described  completely  by  a  set  of  state  transition  probabilities  A  and 
initial  state  probabilities  11.  Assuming  stationarity  of  this  process  allows  the  transitions 
to  be  independent  of  time. 

A  =  (tty)  =  P{qt  =j\qt-i  =P{qt=j\qt-i  =  i)  (1) 

An  ergodic  model  is  generally  assumed  to  allow  the  full  set  of  transitions  between  all  states. 
Most  often  in  using  speech,  a  restricted  set  is  used.  A  left-to-right  model  is  composed  of  an 
upper  diagonal  A  matrix,  and  occasionally  further  restricts  skipping  states.  An  example 
of  the  standard  left-to-right  model  is  shown  in  Figure  1. 


n  =  (TTj)  =  P  =  i) 


(2) 
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Figure  1.  Standard  three  state  left-to-right  multivariate  Gaussian  mixtnre  hidden  Markov 
model.  Shown  with  upper  triangular  transition  matrix  A.  Each  state  is  de¬ 
scribed  by  a  parametric  output  density  bi{Of). 

For  the  left-to-right  model,  tti  =  1  and  nj  —  0,  j  >  1.  These  will  not  have  to  be  reestimated. 
The  states  of  an  ergodic  model  are  also  characterized  by  stationary  distributions  so  that 

n°°  =  (7rr)  =  P(5,  =  z)  (3) 


At  each  time,  the  system  generates  an  observation  0(  based  on  some  probabilistic 
function  of  the  Markov  chain.  It  is  this  function  which  is  the  most  important  component 
of  the  HMM  [124].  The  output  distribution  function  for  each  state  can  be  either  discrete 
or  continuous.  In  the  discrete  case,  the  distribution  function  is  a  set  of  probabilities 
associated  with  each  output  symbol.  Often  these  symbols  relate  to  a  particular  codeword 
of  a  codebook.  Typically,  the  output  function  is  continuous  -  a  convex  combination  of 
multivariate  Gaussian  densities. 

M 

^  P-iki 

k=l 

^  ^  1  f  1  ') 

where  this  density  has  parameters  Ci^,  ftik,  and  Sj*,,  denoting  the  mixture  weights,  mean 
and  covariance  for  the  z-th  state  and  fc-th  mixture,  respectively.  This  now  enables  a  formal 
definition  for  a  hidden  Markov  model. 
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Definition  II. 1  (Hidden  Markov  Model)  A  Hidden  Markov  Model  is  a  probabilistic  func¬ 
tion  of  a  first  order  Markov  state  process,  denoted  by  the  triple  A=  (11,  A,  B)  where  11  is 
the  N  X  I  vector  of  initial  state  probabilities,  A  is  the  N  X  N  matrix  of  transitions  and  B 
is  the  set  of  all  parameters  describing  the  unconditional  output  state  density  for  all  states. 
These  include 

•  fiik:  mean  for  state  i,  mixture  k 

•  E jfc ;  covariance  for  state  i,  mixture  k 

•  Cif.:  state  i,  mixture  k  weight 


The  maximum  likelihood  estimation  of  all  parameters  will  be  examined  in  Chapter 
III.  The  trained  Markov  models  can  be  compared  to  observation  sequences,  by  a  decoding 
process  which  attempts  to  uncover  the  hidden  state  sequence  and  provides  a  likelihood 
of  the  observation  given  the  model  parameters.  Consider  an  observation  sequence,  O  = 
{0i,02, . . . ,  Ot},  with  its  corresponding  hidden  state  sequence  Q  =  {qi,q2,. . . ,  qr}  [100]. 
Making  use  of  the  model  assumptions,  the  likelihood  of  the  observation  sequence  for  this 
state  sequence  is 

PiO\Q,X)  =  flp{ot\quX) 

t=i 

which  can  be  expanded  using  the  the  Markov  property. 

p(Q|A)  ®9i  ,92  •  •  •  ®9t-i,9t 

Solve  for  the  marginal  likelihood  as  follows. 

p{0,Q\X)=p{0\Q,X)p{Q\X) 

pio\x)  =  J2p(^\Q^^)piQ\^)  (5) 

Q 

Equation  5  provides  the  likelihood  of  a  sequence  and  is  used  to  score  how  similar  a  sequence 
O  compares  to  a  particular  model  A.  This  exact  calculation  requires  on  the  order  of  N'^ 
summations.  The  Viterbi  decoding  algorithm  approximates  this  quantity  by  the  joint 
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likelihood  of  observation  and  hidden  state  sequence.  This  can  be  accomplished  in  only 
N^T  operations. 

p{0\X)  «  maxp(0,  Q|A)  (6) 

Q, 

Ephraim  [81]  has  shown  that  the  difference  between  the  two  approaches  is  bounded.  The 
logarithm  of  this  last  expression,  Equation  6,  will  be  used  in  all  classification  experiments 
to  score  a  test  observation  for  a  particular  speaker  model  A.  For  our  research,  a  speaker 
will  actually  be  represented  by  22  phoneme  or  subword  models.  Current  speech  recogni¬ 
tion  techniques  would  create  49  phoneme  models  or  over  3000  context-dependent  triphone 
models  for  unrestricted  vocabulary.  An  efficient  Viterbi  decoding  method  using  multiple 
models  and  allowing  easy  grammar  constraints  is  the  Token  Passing  algorithm  [133]. 

2.2.1  Standard  Assumptions.  Hidden  Markov  models  are  providing  the  most 
successful  methods  for  automatic  speech  recognition.  Speech  is  ideally  suited,  in  some 
respects,  to  HMM  modeling  since  speech  is  “quasi-stationary,”  i.e.,  the  statistics  are  un¬ 
changing  over  small  frames  of  30-70  msecs  [90].  However,  adequate  speech  recognition 
performance  requires  tripling  the  feature  dimensions  by  concatenating  first  and  second 
order  regression  features,  indicating  the  basic  HMM  model  with  Gaussian  mixtures  may 
be  lacking  capabilities  in  capturing  the  dynamics  of  the  observations.  The  need  for  these 
transitional  features  can  be  found  in  the  inherent  model  assumptions.  Many  tutorial  pa¬ 
pers  can  be  found  for  the  standard  hidden  Markov  model  [93,  97,  100],  where  the  following 
assumptions  are  required. 

•  First  Order  Markov  state  process  Hidden  state  sequence  conforms  to  a  discrete 
Markov  chain  stationary  process: 

P{qt  =ikt-i  =  ■  ■  ■  ,quOt-i,Ot-2,  -  •  •  lOi)  =piqt  =  j\qt-i  =  i)  =  aij 

•  Observation  Independence:  Observations  are  independent  of  their  past  values: 

P{Ot,  qt\qt-i,-  •  • ,  gi,  Ot_i, . . .  Oi)  =  p{Ot,qt\qt-i) 
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•  Current  State  Dependence:  Observations  independent  of  past  observations  and 
also  of  past  states: 

p{Ot,qt\qt-u-..,qi,Ot_i,---,Oi)=p{Ot,qt\qt-i)  =  aijp{Ot\qt) 

•  Output  Probability  density  family;  Output  defined  by  a  mixture  of  M  normal 
densities: 

2.2.2  Removal  of  Output  Independence  Assumption.  Though  hidden  Markov 
models  have  been  the  model  of  choice  for  the  past  decade  in  speech  recognition,  the  as¬ 
sumption  of  state-conditioned  observation  independence  is  not  valid.  This  prompted  the 
development  of  output  densities  produced  by  other  stochastic  functions  of  the  observations. 
The  earliest  known  is  the  Hidden  Filter  HMM  by  Poritz  [92].  Instead  of  the  simple  discrete 
or  continuous  normal  output  density  conditioned  only  on  state,  this  likelihood  is  condi¬ 
tioned  both  on  state  and  past  observations.  Observation  frames  are  assumed  generated  by 
an  autoregressive  source.  Equation  7.  The  general  p-th  order  autoregressive  AR(p)  model 
[63,  94]  bases  the  current  output  on  p  past  outputs.  Let  an  observations  Ot  be  frame  of  K 
samples  such  that  O*  =  . . .  ,xk). 

p 

xt  =  -'^  ajXt-j  +  et  =  Xt-\-et  (7) 

j=i 

where  Oj  is  the  J-th  predictor  coefficient  and  the  process  is  typically  a  Gaussian  white 
noise  process  with  variance  cr^.  The  term  autoregression  implies  Xi  is  a  linear  regression 
on  itself  with  Xt  representing  the  prediction  of  at  time  t.  This  simple  model  works 
particularly  well  for  voiced  speech  segments  [27,  90].  Using  this  linear  relation,  it  is  easily 
seen  the  probability  density  function  of  a  sample  given  past  samples  has  the  same  density 
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of  et,  only  shifted^ 


^2  (^t  — 1  ) 


2)  ' 


1  1  ^ 
^  (27ra?y/2 


Surprisingly,  the  unconditional  probability  density  function  for  the  entire  frame  O*  has 
the  same  functional  form  as  the  conditional  sample  density,  since  the  noise  process  is 
independent. 


K 


5  .  .  •  )  ^t—p) 

<=1 


(8) 


where  cr?  is  the  noise  variance  over  the  K  samples  within  a  frame. 

These  models  were  further  generalized  to  linear  AR  mixture  models  by  Juang  and 
Rabiner  [60]  and  later  used  within  ergodic  structures  by  Tishby  [124].  Equation  8  has 
an  efficient  form,  first  demonstrated  by  Juang  [57].  The  output  density  for  an  autore¬ 
gressive  frame  0(  =  ,  ^xk)i  for  state  i  described  by  predictor  coefficients  = 

(tti,  02)  •  •  ■ )  ttp)  and  noise  variance  <jj  is 


_^exp{-^«(0.,a,)} 


(9) 


and 

6{Ot,ai)  =  r„(0)r3,(0)  +2^ro(;)7',,(;)  (10) 

i=i 

where  6{Ot,  Oj)  can  be  considered  a  distortion  or  distance  metric  between  a  frame  Oj  and 
a  hidden  filter  a,.  The  efficiency  of  this  equation  is  that  the  frame  samples  need  not  be 
known  -  only  the  biased  autocorrelation  estimate,  rj,,  of  the  frame  and  the  autocorrelation 
of  the  filter  Va-  Equation  9  describes  a  single  mixture  of  an  HMM  state.  For  a  state  i  with 


^The  dilemma  we  are  faced  with  is  notation.  All  signal  processing,  statistical  modeling  uses  “aj”  as  a 
predictor,  autoregressive  or  HR  filter  coefficient.  Also,  the  hidden  Markov  literature  always  uses  “aij”  as 
a  transition  probability.  Since  the  latter  has  little  significance  in  this  research,  it  should  be  clear  filters  are 
often  discussed. 
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M  mixtures, 


M 

h{Ot)  =  51  CimbimiOt)  (11) 

771=1 

this  architecture  will  attempt  to  model  a  p-th  order  filter  0,^^  for  each  state  i  and  each 
mixture  m.  This  description  defines  the  frame  autoregressive  hidden  Markov  model,  graph¬ 
ically  shown  in  Figure  2. 


an  a22  a33 


Figure  2.  Juang’s  frame  autoregressive  mixture  extensions  to  the  Poritz  hidden  filter. 

While  Poritz  proposed  single  filter  states,  Juang  extended  to  multiple  mixtures. 

Definition  II. 2  (Frame  Autoregressive  Hidden  Markov  Model)  A  frame  autoregressive 
hidden  Markov  model  is  a  probabilistic  function  of  a  first  order  Markov  state  process, 
denoted  by  the  triple  A  =  (11,  A,  B)  where  11  is  the  N  x  1  vector  of  initial  probabilities, 
A  is  the  N  X  N  matrix  of  transitions  and  B  is  the  set  of  all  parameters  describing  the 
conditional  output  state  densities.  These  include: 

•  Aim  =  (®imi)  ®im2)  •  •  • ,  o-imp):  P'lh  Order  filter  coefficients  for  state  i,  mixture  m 

•  .•  residual  error  variance  for  state  i,  mixture  m 

•  Cim:  state  i  mixture  m  weight 

The  approach  taken  by  Kenny  [66]  and  later  Woodland  [131]  models  the  vector¬ 
valued  features  as  an  linear  predictive  source.  Including  a  separate  mean  per  state,  the 
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vector  observations  are  assumed  generated  by 


Ot  —  m(9<)  Qt-i)  +  ,  +Ai{qt,  +  Ef  (12) 

where,  Et  is  a  multivariate  white  Gaussian  process  with  covariance  Ti{qt,qf^i).  Note  this 
model  uses  the  notation  of  “emission  on  state  transition”,  where  the  quantities  of  interest 
are  conditioned  on  the  state  pair,  {qt,qt-i).  Kenny  applied  this  model  to  phoneme  recog¬ 
nition  and  examined  specific  lags  1.  His  results  indicated  no  improvement  over  standard 
hidden  Markov  models.  Woodland  used  the  more  common  “emit  on  state”  assumption 
with  a  state  model  of  the  form 

Ot  =  +  Aj{Ot-j  -  p.j)  +  Ef.  (13) 

1=1 

This  regression  is  similar  to  Kenny’s  model,  but  with  the  added  offset  mean  parameters,  flj. 
He  also  selects  a  portion  of  the  residual  space  to  enhance  discrimination.  The  corresponding 
multivariate  output  density  for  state  i  is  given  by 

«0.)=p;^55^exp{-i£fE.-r£,}  (14) 

where  the  T  transformation  selects  the  most  discriminating  dimensions.  Woodland  was 
able  to  demonstrate  better  performance  by  reducing  the  feature  sizes  (using  the  T  trans¬ 
form)  when  applied  to  a  small  “E-set” 

To  date,  the  only  application  of  the  two  previous  models  have  focused  on  linear 
prediction  using  specific  lags  (forward  or  backward)  and  applied  to  phoneme  recognition. 
Since  knowledge  of  the  most  important  lags  is  unavailable,  either  for  speech  or  speaker 
recognition,  a  full  autoregressive  should  will  be  examined.  The  multivariate  conditional 
output  density  defined  in  Equation  14,  without  the  mean  offset  p,j  and  transform  matrix 
T,  will  be  defined  as  the  vector  autoregressive  hidden  Markov  model. 

^The  E-set  typically  consists  of  the  small  English  alphabet  (B,C,D,E,G,P,T,V). 
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Definition  II. 3  (Vector  Autoregressive  Hidden  Markov  Model)  A  vector  autoregressive 
hidden  Markov  model  is  a  multivariate  probabilistic  function  of  a  first  order  Markov  state 
process,  denoted  by  the  triple  X  =  (11,  A,  B)  where  11  is  the  N  X  1  vector  of  initial  proba¬ 
bilities,  A  is  the  N  X  N  matrix  of  transitions  and  B  is  the  set  of  all  parameters  describing 
the  conditional  output  state  densities.  These  include 

•  pim :  mean  vector  for  state  i,  mixture  m 

•  =  {Aii,Ai2, . . . ,  Aip):  p-th  order  filter  matrices  for  state  i,  mixture  m 

•  Sim-'  multivariate  noise  covariance  for  state  i,  mixture  m 

•  Cjm-'  state  i  mixture  m  weight 


This  section  described  the  research  in  linear  dynamic  systems,  applied  most  often  to 
speech  recognition.  The  common  philosophy  to  all  these  approaches  examines  the  statistics 
of  the  observations  within  a  state.  Standard  hidden  Markov  models  assume  features  are 
generated  as  a  constant  state  mean  with  any  observation  errors  accounted  by  the  covariance 
estimate.  Hidden  filters,  on  the  other  hand,  account  for  the  (prediction)  error  after  a  linear 
regression  is  applied.  The  next  section  examines  the  approach  when  linearity  is  removed 
from  the  state  model. 

2.2.3  Nonlinear  Hybrid  Markov  Models.  Several  researchers  have  recently  com¬ 
bined  the  pertinent  features  of  HMMs  and  multilayer  perceptrons  or  neural  networks.  The 
HMM  provides  an  explicit  discrete  state  model,  including  efficient  optimization  strategies 
of  model  parameters;  the  neural  networks  provide  nonlinear  input-output  mappings,  and 
discriminative  class  estimation. 

The  first  complete  treatment  of  HMM  hybrids  is  the  recent  work  by  Bourlard  and 
Morgan  [13,  14].  Their  presentation  of  the  subject  of  HMMs  is  based  on  variations  of 
“local  contribution”,  which  they  define  as  the  joint  probability  of  the  state  and  observation 
conditioned  on  all  previous  states  and  observations  (and  the  current  set  of  weights  W). 

p{qt  =  i,Ot\qu  ■  ■  ■  (15) 
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Hybrid  techniques  then  make  various  simplifying  assumptions  or  relaxations  of  this  like¬ 
lihood  and  attempt  to  approximate  it  through  MLPs  or  recurrent  architectures.  Gener¬ 
alization  of  the  local  contribution  in  Equation  15  can  use  past  and  future  observations. 
Bourlard  and  Morgan  use  a  feedforward  MLP  to  approximate  this  likelihood  by  training 
with  state  desired  values. 

Neural  networks  can  also  be  trained  to  approximate  both  nonlinear  autoregressive 
(NAR)  and  nonlinear  autoregressive  moving  average  (NARMA)  through  gradient  descent 
learning.  If  the  stochastic  inputs  are  unknown,  as  is  usually  the  case,  they  may  be  ap¬ 
proximated  by  using  the  prediction  residual  of  the  previous  prediction  [24].  All  these 
stochastic  time  series  models  can  be  extended,  in  theory,  with  a  Markov  structure.  One 
such  NAR/  HMM  hybrid  approach  was  developed  by  Levin  [70]  called  the  “Hidden  Control 
Neural  Network”  and  later  detailed  in  [71].  A  few  enhancements  and  applications  by  other 
researchers  have  also  been  published  [39,  121,  122]  and  shown  successful. 

During  the  past  three  years,  the  similarities  of  hidden  Markov  models  and  recurrent 
architectures  have  been  studied.  These  interpretations  have  been  accomplished  by  Bri¬ 
dle  and  Kehagias  [15,  64,  65]  and  explicitly  used  for  phoneme  recognition  by  Robinson 
[105,  106,  107,  108].  The  recurrent  architecture  can  be  shown  as  a  non-linear  state-space 
model.  Robinson,  for  example,  uses  these  networks  to  retain  context  in  the  hidden  ac¬ 
tivation  nodes.  Standard  HMM  processing  can  then  be  integrated  on  the  back-side  for 
hierarchical  word  modeling,  state-duration  modeling  and  overall  word  likelihood  calcula¬ 
tion.  Like  Bourlard  and  Morgan,  Robinson  requires  specialized  hardware  to  calculate  the 
error  gradients  during  training,  due  to  the  extensive  amounts  of  training  data  needed  for 
reliable  speaker  independent  subword  modeling. 

2.3  General  Hidden  Filter  Framework 

Extensions  to  the  standard  Gaussian  mixture  HMM  have  developed  recently  to  add 
context  and  discriminative  capabilities.  Context  has  been  attempted  through  the  use 
of  linear  prediction,  whereas  discriminative  learning  is  provided  by  feedforward  MLPs  or 
feedback  recurrent  networks.  Other  related  Markov  modulated  sources  have  included  noise 
corrupted  polynomials  [30]  and  mixed  state-observation  approaches  [45].  Each  method,  to 
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date,  fits  into  the  general  framework  of  Markov-modulated  dynamic  systems  or  Generalized 
Hidden  Filter  Markov  Models  ( GHFMM).  The  underlying  state  probabilistic  functions  may 
be  linear  or  non-linear,  conditional  or  non-conditional,  and  causal  or  noncausal.  Figure  3 
shows  all  HMM  approaches  in  a  new  unified  framework. 

This  research  indicates  a  wide  range  of  applicability  to  modeling  general,  possibly 
even  chaotic,  time  series.  Many  applications  requiring  prediction,  monitoring  of  dynamic 
systems  or  classification  of  noise  corrupted  observations  potentially  benefit  through  the 
use  for  GHFMMs.  This  research  will  specifically  examine  classification  of  acoustic  signals, 
which  can  be  considered  noise  corrupted  observations  from  the  a  particularly  personal 
dynamic  system  -  human  speech  production. 

Chapter  III  will  demonstrate  that  hidden  filter  Markov  models  can  be  applied  to 
raw  speech  samples,  frames  of  speech  or  processed  features.  The  following  sections  in  this 
chapter  examine  the  typical  processing  of  raw  speech  into  features  which  is  often  performed 
prior  to  speech  or  speaker  recognition.  The  last  section  demonstrates  the  feature  extraction 
procedure,  then  it  will  be  shown  these  features  are  highly  correlated. 

2.4  Feature  Analysis 

Standard  speech  processing  techniques  were  used  to  extract  features  from  the  raw 
samples.  It  should  be  noted  that  no  “best”  feature  set  has  been  determined  for  speaker 
recognition  tasks.  Since  speaker  modeling  has  such  a  rich  history  -  one  which  parallels 
speech  recognition  -  many  popular  features  have  been  examined.  [1,  2,  4,  11,  25,  47,  53, 
61,  62,  77,  114,  120].  Recent  studies  for  open  set  speaker  identification,  on  both  high 
quality  TIMIT  [56,  84]  and  tactical  radio  GREENFLAG  [40,  41]  databases  indicate  that 
no  one  feature  may  prove  optimal  in  all  cases  [91]. 

2.4.1  Signal  Processing  of  Speech.  Features  are  extracted  using  many  standard 
signal  processing  techniques  [99,  98,  59,  134].  The  speech  signal  traverses  through  many 
stationary  points  with  specific  spectral  signatures.  It  is  these  short-time  signatures  which 
separate  phones  or  phonemes.  Accordingly,  a  phone  is  the  smallest  individual  acoustic 
unit,  in  the  field  of  phonetics  [87,  90].  In  the  study  of  descriptive  linguistics,  the  small- 
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Figure  3.  Architecture  for  Generalized  Hidden  Filter  Markov  Models  (GFHMM). 

est  unit  is  the  phoneme.  A  phoneme  is  that  entity  which  must  be  altered  to  change 
word  meaning,  i.e.,  “bat”  and  “cat”  differ  only  in  the  phoneme  /b/.  Since  there  is  much 
overlap  between  the  two  fields  of  study,  this  research  will  use  Parson’s  definition  [90]. 


Definition  II. 4  (Phoneme)  A  Phoneme  is  the  smallest  acoustic  unit  in  a  given  language 
that  is  able  to  change  word  meaning.  A  model  of  this  unit  will  be  referred  to  as  a  phoneme 
or  monophone  model. 
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Modeling  of  speaker  dependencies  within  phoneme  acoustics  will  be  explored.  As  such, 
labeled  phonetic  data  will  be  required  for  initial  training  and  separate  phoneme  models 
will  be  created  for  all  speakers.  As  will  be  discussed  in  Chapter  V,  testing  will  string 
together  the  correct  phoneme  models  relating  to  the  particular  phrase  prompted. 

All  raw  data  consist  of  8  kHz  sampled  speech.  The  original  signal  conditioning  and 
acquisition  were  designed  by  Campbell  [67]  to  provide  bandwidth  and  linear  phase  up  to 
3.8  kHz.  The  resulting  bandpass  filter  response  models  the  DoD’s  STU-HI  secure  voice 
terminal’s  input  characteristics  very  closely. 

Analysis  frames  of  20  msec  are  first  pre-emphasized  to  remove  lip  radiation  effects  by 
a  simple  high  pass  filter.  Then,  a  Hamming  window  is  applied  to  decrease  frame  edge  effects 
in  the  Fourier  transform.  Frames  are  analyzed  every  10  msec.  If  one  stops  at  this  point 
and  displays  the  magnitude  of  the  resulting  short-term  Fourier  transform,  a  spectrogram 
results  (See  Figure  4  for  an  example  YOHO  database  combination  lock  utterance). 


O' 

o 


Figure  4. 


The  magnitude  transform  coefficients  are  correlated  with  each  of  24  triangular  filters 
spaced  linearly  up  to  1  KHz  and  logarithmic  thereafter,  see  Figure  5.  On  a  Mel  scale,  the 
filters  are  spaced  linearly, 

Mef(/)=:25951ogio(l  +  ^). 

This  nonlinear  frequency  analysis  models  human  perception  [90]  and  empirically  improves 
speech  recognition  performance  [1,  2,  3].  The  logarithm  of  the  energy  outputs  from  these 
filters,  denoted  rrij,  are  the  Mel  frequency  spectral  coefficients.  To  reduce  and  decorrelate 
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Figure  5.  Typical  Speech  Processing/  Feature  extraction. 


these  =  24  coefficients,  a  Discrete  Cosine  Transform  is  applied  which  reduces  the  features 
to  12  Mel  frequency  cepstral  coefficients,  Cj. 


A  raised  cosine  is  applied  to  account  for  noisy  low  and  high  order  coefficients.  This  filtering 
process,  called  liftering,  uses  the  following  weighting  for  L  =  20. 


c;. 


,,  L  .  iri. 


Lastly,  to  remove  channel  effects,  removal  of  the  Mel  frequency  cepstral  time  average  is 
performed.  This  homomorphic  deconvolution  [88]  compensates  for  microphone  and  other 
long-term  recording  effects  present  in  the  signal.  The  logarithm  of  the  frame  energy  is 
appended  to  all  cepstral  vectors.  This  value  is  normalized  by  the  maximum  energy  present 
in  the  utterance.  Thus,  the  baseline  feature  contains  13  coefficients. 


2.4-2  Cepstral  Characteristics.  Digalakis  [32]  recently  examined  linear  and  non¬ 
linear  regression  of  the  cepstral  coefficients  within  and  between  phoneme  segments.  His 
conclusions  were  that  within  phoneme  segments,  a  linear  regression  (model)  can  explain 
up  to  88%  of  the  variance  in  predicting  the  next  cepstral  vector  for  most  frames.  However, 
between  phonemes  the  linear  model  breaks  down. 
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The  following  graph  show  the  scatterplots  for  the  first  cepstral  coefficient  at  various 
lags  I,  see  Figures  6.  Each  subplot  presents  Ci(t)  against  Ci{t  +  /).  The  inset  is  the 
calculated  correlation  coefficient  for  this  data.  These  indicate  that  close  frames,  separated 
up  to  70  msec,  are  not  statistically  independent  in  time.  Also,  the  scatterplots  appear 
Gaussian  through  the  seventh  lag.  The  statistical  independence  assumption  of  standard 
hidden  Markov  models  is  obviously  not  valid  and  must  be  removed.  Digalakis  suggests  a 
linear  model  will  be  appropriate  and  relevant  for  phoneme  modeling.  For  larger  subword 
models  (syllables,  diphones,  etc.)  possibly  the  hybrid  HMM/  neural  approaches  are  more 
suitable. 


Figure  6.  Scatterplot  of  first  cepstral  coefficient  ci(t)  for  lags  0-7.  Each  point  within 
a  subplot  is  the  order  pair  (ci(t),Ci(t  -f  1))  where  I  is  lag.  Inset  within  each 
subplot  is  the  correlation  coefficient  over  all  data. 


2.4-3  Transitional  Coefficients.  One  method  of  modeling  transitional  effects  in 
the  observations  is  through  the  use  of  regression  coefficients  often  denoted  by  Af.  These 
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coefficients  are  found  by  fitting  the  best  linear  line  through  a  set  of  observations 
{Ot-w>  Ot-w+ii-  ■  ■  Ot,  0<+i, . . . ,  Of+iv) 


and  passing  through  the  point  Of,  using  a  window  of  width  ±.W .  The  equation  for  this 
linear  line  is  y  =  a  •  k  +  Ot  which  is  a  shift  of  the  origin  to  the  place  t  with  an  unknown 
slope,  a.  Define  the  squared  error  cost  criterion  to  be 

w  w 

J=  Y.  {0^^k-yf=  Y  {o^+k-ak-o,f. 

k=-W  k=-W 


The  value  of  the  slope  a  which  minimizes  this  quadratic  occurs  at  a  dJ/da  =  0.  Thus 
w 

0  =  ^  2(Ot+k  -  ak-  Ot){-k) 

k--W 

-w  w 

=  Y  HOt+k  -ak-Ot)+  Y  HOt+k  -  ak  -  Ot)  +  Y  k{Ot+k  -  ak  -  O*) 

0  fe=-l  k=l 


and  letting  I  =  —k, 

0  =  0  —  ^(— +  al  —  Ot)  +  Y/  k{Ot+k  —  ak  —  Ot) 

i=l  k=l 

W  W 

=  Y(0(-Ot-i  -al  +  Ot)  +  Y  k(Ot+k  -ak-  Ot) 

1=1 

and  combining  summations, 


w 


w 


k=l 


0 


w 


w 


w 


Y  k{Ot+k  -  Ot-k  -  2ak)  =  Y  k{Ot+k  -  Ot-k)  -  2a  ^ 


k=l 


a  = 


k=l 


k=l 


rZi  k(Ot+k  -  Ot-k) 


2EZik^ 


(16) 


This  linear  least  squared  error  solution  to  the  slope  is  the  standard  regression  coefficient 
found  in  calculating  “Delta”,  and  subsequently  “Delta-Delta”,  coefficients  in  speech  recog¬ 
nition.  The  approximation  to  this  regression,  called  the  differenced  coefficient  is  sometimes 
also  used.  Equation  17. 

=  Ot+w  —  Ot-w  (17) 
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2.5  Conclusion 


This  chapter  has  developed  a  general  hidden  Markov  model  framework,  and  reviewed 
the  necessary  linear  submodels  within  each  state.  The  motivation  to  extend  the  current 
techniques  is  better  acoustic  modeling  of  phoneme  context  and  correlation.  The  full  po¬ 
tential  of  temporal  stochastic  models  has  not  yet  been  applied  to  the  speaker  modeling 
problem.  To  date,  published  material  for  speaker  modeling  has  used  frame  based  linear 
prediction  within  an  ergodic  HMM  structure.  It  will  be  demonstrated  that  ergodic  models 
greatly  account  for  the  effects  due  to  language,  rather  than  the  speaker.  To  circumvent 
language  modeling,  speaker  dependent  phoneme  hidden  filter  modeling  is  proposed. 

The  phoneme  continues  to  be  the  popular  subword  unit  for  speech.  The  acoustics 
within  a  phoneme  segment  are  relatively  stationary  and  as  such,  this  research  will  focus  on 
their  speaker  dependent  modeling.  This  approach  provides  an  inherent  text-independent 
application  since  the  set  of  all  phonemes  can  be  modeled.  For  experimentation,  the  YOHO 
database  will  be  used  which  constrains  utterances  to  combination  lock  phrases,  which  only 
need  a  subset  of  the  full  phoneme  acoustic  space.  The  next  chapter  proposes  new  extensions 
and  develops  the  reestimation  of  these  extensions  to  the  baseline  hidden  Markov  model. 
Methods  will  be  shown  applicable  to  raw  samples  of  a  signal,  frames  of  samples  or  a 
sequence  of  processed  feature  vectors. 
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III.  Model  Reestimation 


3. 1  Introduction 

In  their  book,  “Connectionist  Speech  Recognition”,  Bourlard  and  Morgan  write. 

For  speech  recognition  in  particular,  it  is  important  to  improve  our  models  to 
better  take  into  account  the  dynamical  properties  of  the  speech  process.  In  this 
framework,  methods  should  be  developed  to  use  more  contextual  information 
for  classification. 

This  chapter  first  presents  the  theory  behind  hidden  Markov  model  reestimation,  then  pro¬ 
vides  the  reestimation  of  hidden  filter  models.  The  choice  of  hidden  filters  comes  naturally 
from  the  long  accepted  speech  production  model  [5,  95,  99].  Speech  can  be  grossly  viewed 
as  source  signal  (either  noise-like  or  periodic)  convolved  with  a  rational  filter  describing  the 
vocal  tract.  While  rational  filter  models  such  as  autoregressive  (AR)  and  autoregressive- 
moving  average  (ARMA)  have  a  rich  history  in  spectral  estimation,  signal  prediction, 
speech  processing  and  economics,  their  effectiveness  within  a  Markov  modulated  structure 
for  modeling  speakers  is  yet  unknown. 

This  chapter  develops  three  levels  of  hidden  filters  and  provides  their  reestimation 
procedures.  The  first  level  models  the  sample  or  raw  observations.  While  this  may  be 
the  most  efficient  [46],  it  also  requires  extensive  calculations  for  both  reestimation  and 
decoding,  due  to  the  amount  and  frequency  of  the  data.  The  next  level  combines  obser¬ 
vations  into  frames.  Efficiency  is  gained  since  the  actual  raw  samples  are  not  needed  in 
reestimation  -  only  the  autocorrelation  of  the  samples.  Also  the  frequency  of  reestima¬ 
tion  has  been  reduced  substantially.  The  last  level  of  modeling  occurs  on  some  processed 
spectral  representation  of  these  frames.  Methods  such  as  the  mean-subtracted,  littered, 
Mel  frequency  cepstral  vectors  have  been  researched  extensively  to  provide  a  compact, 
decorrelated  representation  of  the  log-spectrum.  This  last  level  of  modeling  reduces  the 
occurrence,  number  and  complexity  of  the  Baum- Welch  algorithm  and  for  this  reason,  it 
will  be  the  primary  technique  applied  to  large  speaker  recognition  experiments. 
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3.2  Hidden  Markov  Model  Reestimation 

The  hidden  Markov  model  reestimation  provides  maximum  likelihood  parameter  es¬ 
timates  given  a  set  of  training  sequences.  The  reestimation  will  be  solved  for  a  single 
observation  sequence  and  is  easily  extended  for  multiple  training  sequences.  Two  proba¬ 
bilistic  quantities,  which  are  often  used  throughout  standard  HMM  model  reestimation, 
are  the  forward  and  backward  variables.  These  take  on  great  significance  in  deciding 
which  observations  get  used  to  reestimate  a  particular  states’  parameters.  These  initial 
calculations  {Forward-Backward  algorithm),  along  with  the  final  parameter  updates,  are 
collectively  called  the  Baum-  Welch  algorithm. 

3.2.1  Forward- Backward  Variables.  Prom  [96,  112],  define 

=PiOi,...,Ot,qt  =  i|A) 

as  the  joint  likelihood  of  the  observation  and  state  Qt  =  i  given  the  model  A. 

The  derivation  for  o:t{i)  is  inductive  (see  Appendix  A).  Two  important  points  are  that 
Q!<+i(z)  is  a  function  of  the  previous  a* 

N 

a<+i(i)  =  bj{Ot+i)'^aijatii) 
i=l 

and  the  forward  variable  evaluated  at  the  last  time  sample  provides  the  total  likelihood  of 
the  observation  sequence  given  this  current  model 

N 

p{Oi...Ot\X)  =  Y^p{Oi...OT,qT  =  i\^) 

i=l 

= 

i=l 

The  backward  algorithm  is  also  inductive,  derived  in  a  similar  manner.  Let 


Pt{i)  =  p{Ot+i . . .  Orlqt  =  i,  A) 
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where  the  backward  variable,  is  the  likelihood  of  observing  the  partial  sequence 

. . .  Ot  given  the  current  state  qt  =  i  and  the  model  A.  The  inductive  calculation 
of  (see  Appendix  A)  becomes 

N 

PS)  = 

i=l 

The  total  likelihood  can  be  evaluated  at  t  =  1. 


N 


p{O^...Ot\X)  =  Y.^MOMi) 


i=l 


3.2.2  State  Likelihood.  Lastly,  a  related  quantity  is  denoted  by  [95,  96] 

P(gt+i  =  i) Qt  =  bOi 


(S,j)  =p{Qt+i  ^j,qt  =  i\Oi...OT,X)  = 


p{Oi . . .  Ot\,  A) 


which  can  be  expressed  in  the  forward  and  backward  quantities  as 

-  °‘S)(^Sj{Ot+i)Pt+i{j) 
p(Oi...Ot|,A)  • 

The  following  single  state  likelihood  is  most  useful  in  practice,  often  denoted  by  jS)- 


N 


7«(*)  =  J2^SJ)  =  p{qt  =  i\Oi . . .  Ot) 

j=l 


3.2.3  Baum  Auxiliary  Function.  The  goal  of  the  training  phase  for  HMMs  is  to 
model  a  set  of  observations  with  a  maximum  likelihood  set  of  parameters  representing  the 
underlying  Markov  process  and  probabilistic  function  of  that  process.  Denote  the  model 
by  A  =  (n.  A,  5),  Given  a  set  of  observations,  {Oi,02,  ■  ■  ■  ,Ot),  search  over  all  A  €  A 
to  maximize  the  likelihood,  p(Oi,  O2)  •  •  •  >  Ot\S-  Brute  force  approaches  would  search  for 
critical  points  of  this  likelihood  such  that  various  probabilistic  constraints  of  A  are  satisfied, 
often  using  Lagrange  techniques  [100]. 

A  better  approach  is  the  Dempster  [28]  Expectation  Maximization  (EM)  algorithm, 
developed  for  maximum  likelihood  estimation  with  missing  data  [28].  The  missing  data  for 
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the  HMM  problem  are  the  unknown  (hidden)  state  sequence.  The  EM  algorithm  solves 
for  the  maximum  likelihood  model  by  first  defining  the  Auxiliary  Function,  Q{X,  A),  which 
is  a  function  of  both  the  current  model  A  and  a  re-estimated  model  A  =  (H,  A,  B). 

Q{\^)=  p(Oi...OT,qi,...,qT\X)logp{Oi...OT,qu...,qT\X)  (18) 

The  properties  which  make  this  optimization  procedure  so  attractive  are  the  following: 

•  If  Q(A,  A)  >  Q(A,  A)  then  p(Oi . . .  Ot\\)  >  p{0^ . . .  Ot|A) 

•  For  a  broad  class  of  models,  Q  has  a  single  global  maximum  -  true  for  a  single  normal 
density. 

First,  p{Oi  . . .  Oy|A)  is  usually  written  as  [100] 

p(Oi...Or|A)  =  Y  P(Oi...OT\qi,...,qT,X)p{qi,...,qT\X) 

9l.92'">9r 

=  ^91^91  (^1)^9192^92(^2)  •  •  •  ®9t-i9t^9t(^t)- 

Expanding  the  joint  likelihood  from  Equation  18, 

logp(Oi . . .  . . . ,  grlA)  =  log  [Tr,,bgAOi)a,,g,bg^{02)  •  •  •  a?T-i9r  V(Or)] 

T  T 

=  log  Ttg,  +  Y  log  «9<-i9<  +  Y  (^‘) 

t~l  t=l 

then. 


o(A, 1)  =  E  p{Oi  ...OT,qi,.-.,  gr|A)  log  tt,, 

91i92)...i9t 

T-l 

+  Y  PiOi---OT,qu...,qT\X)Y^°S  ^Qt  qt+1 

91.92.. ”,9t  t=l 

T 

+  Y  p(^i---OT,qi,---,qT\x)Y^^sbgi(Ot) 

91192.. ...9t  <=1 
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and  by  defining  subfunctions, 


N 

Q(A,A)  =  QniX,n)  +  QA(X,A)  +  Y,Qii^,Bi) 
where  the  use  of  Kronecker  delta  function,  ),  can  be  used  to  sample  a  particular  state. 


Qn(A,n) 


QAiX,A) 


Qb{X,  Bi) 


p(Oi  ...OT,gi,...,9T|A)log7r,, 

N 

Y  P{Oi  qrlX)  Y  “  *) 

91.92, ---i9r  2=1 

N 

YpiOi...OT,qi  =ilA)log7ri 

i=l 

T-1 

Y  p(Oi...  Ot,  gi,  •••,  grlA)  ^  log  ®9<  9i  +  i 

91.92, ---!9r  t=l 

T-1  N  N 

Y  PiOi ...  Ot,  gi,  •••,  gr  I  A)  ^  ^  ^  log  aij6{qt  -  i)6{qt+i  -  j) 

9l>92i".,9T  <=1  i=l  i=l 

T-1  N  N 

Y  H ■■■Ot, g<+i  =  i, qt  =  i|A) log Oij' 

t=i  i=i  j=i 

T 

Y  PiOi . .  .OT,qi, . . .  ,qT\X)Y^^Sh{Ot)6{qt  -  i) 

91.92,  ■■■i9t  1=1 

T 

^p(Oi  ...Or,gi  =  *|A)log6i(Ot) 

1=1 


It  is  readily  noted  that  the  auxiliary  function  can  be  maximized  individually  for  A,  n  and 
the  output  density  parameters  contained  in  B.  Scaling  the  Q-function  by  p{Oi  . . .  Ot|A) 
results  in  the  following  (shown  for  Qf,  only),  where  7<(i)  is  a  product  of  the  Forward- 
Backward  algorithm. 

T 

QtiX,Bi)  =  Yp(Oi---OT,qt  =  i\X)\ogbi{Ot)/p{0,...OT\X) 

1=1 

T  T 

=  =  *1^1  •••^T,A)log6i(Ot)  =  ^7t(i)log6j(Ot)  (19) 

1=1  1=1 
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The  update  equations  are  found  by  examining  critical  points  of  this  scaled  Q-function. 
The  solution  for  the  output  density  parameters  of  a  single  normal  yields 


ELi  Ejli  at-i{i)aijbj{Ot)l3t{j)Ot 

Er=i  T,f=i 

EliJtii) 


(20) 


and  similarly, 


EI=i  -  /^j)A(i) 

Ef=i  EJLi  o:t-i(i)aijbj^t(j) 

ELMi) 


(21) 


Note  that  when  the  number  of  states  equals  one,  then  7t(l)  =  1  for  all  t,  and  Equations 
20  and  21  are  the  maximum  likelihood  estimates  for  a  mean  and  covariance  of  a  random 
sample.  Note  also  these  equations  are  all  functions  of  the  Forward-Backward  variables, 
which  in  turn  are  derived  from  the  current  model  A.  While  this  holds  for  single  Gaussian 
densities,  multiple  mixtures  may  be  estimated  providing  a  richer,  statistical  model  for  each 
state. 


This  section  has  presented  the  standard  Baum- Welch  algorithm  for  the  output  density 
parameters,  /x^,  a?  and  their  mixture  extensions.  We  have  purposely  not  examined  the 
transition  matrix  or  the  initial  state  probabilities,  because  all  further  techniques  and  models 
will  not  change  their  reestimation.  The  derivation  can  be  found  in  Rabiner  [95,  96].  The 
scope  of  the  remaining  sections  within  this  chapter  full  examines  the  assumptions  of  the 
output  density  functions,  bi{Ot). 


3.3  Hidden  Filter  Markov  Model  Reestimation 

Standard  Markov  models  describe  observations  as  noisy  realizations  of  a  constant 
signal  for  each  state.  This  research  examines  models  describing  linear  dynamic  systems 
for  each  state.  These  hidden  filters  may  be  applied  at  various  levels,  based  on  the  nature 
of  the  dynamics.  The  first  level  applies  to  the  actual  samples  themselves.  For  voiced  and 
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some  unvoiced  speech  signals,  based  on  rational  polynomial  source  models,  this  appears 
quite  appropriate.  Also,  some  glottal-stop  consonants  last  for  only  a  few  microseconds, 
shorter  than  the  typical  frame  length. 


3.3.1  Yule- Walker  Approach.  The  Yule- Walker  equations,  known  also  as  the 
Wiener-Hopf  or  normal  equations,  provide  the  maximum  likelihood  estimate  (MLE)  for 
the  predictor  coefficients  assuming  a  random  scalar  process  generated  by 


p 

Ot  =  —  ^2  ^kOt-k  +  Ct 

fc=i 


(22) 


where  the  a*  is  the  kth.  autoregressive  or  predictor  coefficient  and  e*  is  assumed  to  be  an 
innovations  sequence  assumed  a  white  noise  process  with  zero  mean  and  variance  cr^.  The 
solution  to  the  filter  coefficients  is  a  set  of  linear  equations  given  by. 


^o(l) 

ro(0)  ro(l)  •••  ro{p-l) 

Oi 

ro{2) 

= 

7’o(-l)  r<,(0)  •••  ro{p-2) 

.  roip)  _ 

_r<,(-p  +  l)  ro{-p  +  2)  •••  ro(0) 

ap 

(23) 


which  uses  the  biased  autocorrelation  estimate  for  a  frame  of  K  samples, 

j  K-i 

~  K  —  i  ? 

‘  j=i 

The  maximum  likely  noise  variance  is  obtained  by  using  the  MLE  filter  coefficients. 

p 

=  ■ro(O)  + 

k=l 

Several  variations  of  these  equations  exist  (such  as  covariance,  modified  covariance,  or 
Burg)  which  make  assumptions  concerning  data  outside  the  frame  boundary  or  use  of  data 
within  the  frame  [27,  63,  123].  This  set  of  equations  will  have  a  similar  counterpart  for 
each  individual  state  of  an  HMM. 
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3.3.2  Reestimation  of  Zero  Mean  AR  Filters.  This  section  details  the  reestima¬ 
tion  of  hidden  filter  Markov  models  using  zero-mean  observations,  with  no  framing.  The 
reestimation  assumes  each  state  is  described  by  a  zero  mean  autoregression  of  the  form 

p 

Ot  =  —  ^2  O'ikOt-k  +  Ct) 

*=1 

where  e*  ~  Ar(0,  cr?).  The  logarithm  of  the  output  density  for  each  Markov  state  i  is  given 

P 

logbi{Ot)  =  -(l/2)log27r-  (l/2)log(7f  -  -h  ^ (24) 

*  k=l 

Solving  for  the  gradient  of  the  auxiliary  function,  which  equals  zero  at  a  critical  point  (see 
Equation  19) 

T-i  1  p 

dQb{\,  Bf)/ dan  =  ^ikOt-k)‘^Ot-i]  =  0. 

t=l  k=l 

Typical  of  linear  systems,  we  solve  a  set  of  p  simultaneous  equations  for  a^fc, 
r-i  T-i  p 

=  '^ytii)'^aikOt-kOi.i,  V/ =  (1, 2, . . .  ,p)  (25) 

t=i  t=i  *=i 

which  is  reminiscent  of  the  autocorrelation  method,  weighted  by  the  state  likelihood  7t(i). 
Solving  these  equations  provides  the  maximum  likelihood  estimate  of  the  filter  coefii- 
cients  for  each  state.  The  noise  variance  cr?  is  then  solved  using  these  values  of  fij*. 

+  '^dikOt-ky /  ^■jt{i)  (26) 

t=l  k=l  t=i 

If  the  same  noise  is  present,  or  assumed  present  across  all  states,  then 

7t(0(<^t  +  Y  ^ikOt-kf/  Y2  7t(*) 

i=l  <=1  fc=l  t=l 

=  7^  Y  Y  TiW(c>t  +  Y  ^ikOt-k?- 

^  i=i  t=i  k=i 
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3.3.3  Reestimation  of  non- Zero  Mean  AR  Filters.  Applying  a  similar  approach 
used  by  Kenny  [66]  on  a  vector  process,  instead  let  the  sample  observations  have  state 
dependent  bias,  /Xj.  The  logarithm  of  the  output  density  becomes, 

1  ^ 

log  6i(Ot)  =  -(l/2)log27r  -  (1/2)  log -  -^{Ot  -  fii  ■  (27) 

k=l 

Solving  for  the  gradients  of  Q  with  respect  to  both  a,;  and  /Xj,  and  critical  values  yields 

T-l  p 

dQb(\  Bi)fdaii  =  ^  7t{i)[{Ot  -  Mi  +  (^ikOt-k)Ot-i]  =  0 

X=1  fc=l 

T-l  p 

dQb{X,Bi)/diJ.i  =  ^7t(x)(Ot-Mi  +  ^aiA:Ot_fc)  =  0. 

X=1  fc=l 


or  shown  in  vector-matrix  notation. 


OtOt-1 

Ot-iOt-i  Ot-^Ot-i  • 

■  Ot-pOf-i  Ot-l 

Oil 

T-l 

0(Ot-2 

T-l 

Ot-lOi^2  0t-20t-2  • 

•  Ot-pOt-2  Ot-2 

0>i2 

7t(*) 

t=l 

OtOb..p 

= 

<=i 

Ot-iOt-p  0<_20t_p  • 

Ot—pOt—p  Ot—p 

atp 

Ot 

Ot-l  Ot^2 

'  Ot-p  1 

fJ-i 

The  noise  variance  cr?  is  then  estimated  using  the  maximum  likelihood  values  of  Mi  and 

Q'ik  • 


O'?  =  7i(0(Oi  -  Mi  +  '^aikOt-kf/  7<(0  (28) 

t=i  fc=i  t=i 

For  a  model  which  used  the  same  driving  statistics  across  all  states  [31], 
dQb{X,Bi)/da^\a,,k[i  =  0 

i=l  i=l  fc=l  X=1 
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=  -Ai  +  E  ^ikOt—k)  ■ 

^  ^  i=l  <=1  *=1 

3. 3. 4  AR  Proof  of  Concept  Trial.  This  subsection  examines  the  ability  to  es¬ 
timate  two  autoregressive  filters  which  switch  ergodically  according  to  a  Markov  process. 
The  forward-backward  procedure  determines  the  likelihood,  7i(i),  that  each  observation 
was  generated  by  a  particular  hidden  state  (i.e.  filter).  The  new  filters  are  reestimated 
by  the  weighted  autocorrelation  given  in  Equation  25  with  the  noise  variance  given  by 
Equation  26.  Figure  8  demonstrates  the  ability  to  recover  the  underlying,  hidden  state 
sequence  by  using  applying  the  maximum  operator  to  the  process  yj. 

The  test  sequence  contains  500  samples  shown  in  Figure  7.  The  ergodic  Markov 
transition  matrix  has  A(l,  1)  =  A(2, 2)  =  0.9  .  The  original  model  parameters  are 

Ai(z)  =  (1,.05,.80),CTi  =  3.00,  ^2(2)  =  (1,  .20, -.50),  crz  =  2.00 

with  the  final  estimates  given  after  eight  Baum- Welch  iterations. 

ii(z)  =  (l,-.02,.78),cfi  =  3.00,  A2{z)  =  (1,  .21, -.53),  cfz  =  2.10. 


This  is  the  first  known  application  of  uncovering  a  hidden  state  sequence  for  a  hidden 
filter  Markov  model,  as  well  as  the  ability  to  estimate  filters  with  state  dependent  noise 
variances.  The  next  section  extends  autoregressive  sample-based  hidden  filter  modeling  to 
a  more  general,  robust,  autoregressive  moving-average  (pole-zero)  filter. 

3.3.5  Reestimation  of  MA  and  ARM  A  Filters.  Other  filters,  besides  the  all-pole, 
autoregressive  can  also  be  Markov-modulated.  Linear  prediction  on  speech  samples  has 
long  been  an  effective  representation  for  voiced  speech  sounds  [27,  63,  78].  However,  for 
many  phonemes,  especially  nasals  and  other  unvoiced  fricatives,  a  moving  average  (MA) 
component  is  more  appropriate  [37,  90].  Bourlard  and  Morgan  strongly  justify  the  use  of 
autoregressive  models,  which  are  suited  well  for  dynamic  systems.  While  very  applicable 
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Figure  7.  Sample  AR(2)  Markov-Modulated  Source  with  Actual  State  Sequence. 

to  speech,  a  better  model  would  be  autoregressive-moving  average  (ARMA)  [94].  Kay  [63] 
observes, 

Since  nearly  all  data  are  corrupted  by  some  amount  of  observation  noise,  the 
ARMA  model  is  nearly  always  the  appropriate  one. 

A  linear  autoregressive  moving  average  ARMA(p,q)  model  is  defined  as 

p  ? 

Of  ^  ]  o-iOf—i  -|-  ^  ^  biCf—j 

i=l  j=l 

where  the  p  and  q  represent  the  order  of  the  Moving  Average  (MA)  and  Autoregressive 
(AR)  processes  and  Cj  is  often  assumed  white,  Gaussian  noise.  This  model  reflects  a  white 
noise  input  to  a  pole-zero  filter,  with  transform 

TVr.'l  =  =  (7^  •  (1  +  hz-^  +  b2Z-^  +  . . .  hgZ-'i) 

A{z)  1  +  aiZ~^ a2Z~'^ -j- . . .  apZ~P 

The  estimation  of  ARMA(p,q)  models  involves  solving  a  set  of  highly  nonlinear  equa¬ 
tions,  thus  only  efficient  suboptimal  techniques  exist.  Durbin’s  approach  [63,  123]  models 
the  A{z)  and  B{z)  filters  separately,  first  solving  for  the  maximum  likelihood  estimate  of 
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Figure  8.  Uncovering  the  AR  Hidden  State  Sequence.  Shown  are  the  actual  state  log 
likelihood  74  (2)  and  the  most  likely  state  sequence  maxj7f(i)  for  the  process 
described  in  Figure  7. 


the  AR(p)  process  then  applying  the  filter  to  create  an  approximate  MA  approximation. 
The  method  is  considered  an  approximate  maximum  likely  estimator  (MLE)  for  the  ARMA 
coefficients. 

First  the  A{z)  filter  from  Equation  29  is  estimated  using  Equations  25  and  26.  Then, 
a  new  approximate  MA  process  is  created  by  filtering  the  original  signal  Of  with  the 
maximum  likelihood  state  filters 

p 

Of  —  (30) 

J=1 


Durbin  approximation  for  MA  filter  estimation  involves  the  following  assumption,  which 
uses  a  large  AR  model  or  order  L  to  approximate  the  MA  coefficients. 


j=o 


1 

Aoo  (^) 


1 

Al{z) 
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All  previous  Markov-modulated  AR  reestimation  (see  Section  3.3.2)  then  applies  to 
this  L-th  order  AR  approximation.  Approximate  MLE  estimates  of  B{z)  use  the  auto¬ 
correlation  method  of  model  order  q  where  the  Ai,{z)  coefficients  . . .  an)  are 

treated  as  “data”  [63]. 

3.3.6  Proof  of  Concept  Trial.  An  examination  of  an  ARMA  Markov  modulated 
process  shows  the  ability  to  estimate  rational  hidden  filters.  The  1000  samples  were  gener¬ 
ated  by  two  ARMA(2,2)  filters  with  Markov  transition  probabilities  of  A(l,  1)  =  A(2,  2)  = 
0.9995  (Figure  9).  Following  an  initial  uniform  segmentation,  eight  Baum- Welch  iterations 
produced  the  following  state  likelihoods  (Figure  10).  Various  large  AR  approximations 
{L  =  10,  20  and  30)  to  the  MA  filter  were  successful.  Note  for  this  example,  the  AR  pro¬ 
cess  was  not  Markov  modulated  and  could  be  estimated  directly.  The  original  rational 


Figure  9.  Markov-Modulated  ARMA(2,2)  Process 


filters  were 

.5(1.00  +  0.50Z-1  +  0.30^-2) 
1.00  -  l.OOz-i -k  0.30Z-2 

.5(1.00  -  0.402-1  0.202-2) 

“  1.00  -  1.002-1  +  0.302-2 


37 


Figure  10.  Uncovering  the  ARMA  Hidden  State  Sequence.  Shown  are  the  actual  like¬ 
lihoods  7<(2)  and  the  most  likely  state  sequence  maxj7i(z)  for  the  process 
described  in  Figure  9. 

Table  1  shows  final  ARMA  filters  and  noise  variance  estimates.  A{z)  was  estimated  to  be 

A(z)  =  1.00  -  IMz-^  -H  0.30Z-2. 


Table  1.  Estimated  Markov-modulated  ARMA  Filters. 


B{z)  w  1/Al{z) 

B2{z)  filter 

Bi{z)  filter 

1.00,  -0.43,  0.13 

0.6975 

0.5623 

1.00,  -0.47,  0.15 

0.6876 

0.5492 

0.6783 

0.5298 

This  example  provided  the  ability  to  find  and  estimate  pole-zero  filters  which  are 
generated  by  a  hidden  Markov  process.  It  has  been  demonstrated  that  for  certain  speech 
phonemes,  ARMA  is  the  model  of  choice.  However,  only  approximate  MLE  methods  exist 
for  their  solutions  and  their  methods  involve  filtering  the  sequence  with  estimated  filters. 
The  next  section  returns  to  autoregressive  model,  but  this  time  on  frames  of  observations. 
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3.^  Frame  Autoregressive  Hidden  Filter  Reestimation 

This  section  examines  the  reestimation  of  filters  when  applied  to  frames  of  observa¬ 
tions.  We  believe  this  technique  has  much  merit,  especially  when  trained  and  applied  in 
a  new  architecture  to  model  speaker  dependent  phonemes.  The  following  derivations  are 
expansions  and  clarifications  from  [29,  57,  58,  60,  72,  92,  93,  96].  First,  assume  a  single 
hidden  filter  for  each  state  modeling  frames  of  observations.  For  any  autoregressive  obser¬ 
vation  ,  Juang  [57,  60]  defines  the  output  density  of  the  frame,  Oj  =  (xi,  0:2, . . . ,  Xk)  when 
the  observation  sequence  length  K  is  much  greater  than  the  autoregressive  order,  as 

a,)  =  exp  . . . ,  a^)}  .  (31) 

The  gain- independent  density,  where  (si, . . . ,  s*)  =  (xi/cr . . .  a^y/cr)  is  simply 

p(si,...,Sif|ai)  =  exp|-^Q(ai,...,Sjf;aj)|  .  (32) 

Juang  uses  a  total  prediction  error  in  the  form  expressed  by 

p 

Oi{xi,...,XK\d)  =  r„(0)r,,(0)  4- 2^ra(07’2(0  (33) 

i=l 

and  the  autocorrelations  are  further  defined  as 

p—i 

—  y]  ajUj+i 
i=i 
K-i 

J=1 

This  derivation  assumes  the  driving  error  was  a  zero  mean  white  process,  normally  dis¬ 
tributed. 

Kay  [63]  defines  a  similar  density  (after  the  first  p  samples)  as 

1  f  1  ^  ^ 

p{xp+i,...,XK\(^,ai)  =  /„  2Vir-p)/2  1  (a^<  +  X] 

I  i  t=p+l  *=1 

^  (27ra?)^(^-rt/2  •  •  •  ’ 
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The  squared  prediction  residual  for  Kay’s  density  a*(xi,.. .  ,xt-,a)  can  be  shown  to  be 
identical  to  Juang’s,  under  the  assumption  of  K  ^  p,  and  the  fact  that  Cq  =  1,  and 
Xt  =  0,  t  <  0,t  >  T 


a*{xi,...,XK-,a) 


(^*  + 

t=p+l  i=l 

P 

T'a(0)r*(0)  +  2^ra(iK(i)  =  a{xi, . . .  ,XK;a). 

i=l 


The  original  method  by  Poritz  [92]  noted  that  another,  simpler  expression  for  the  prediction 
error,  realized  through  a  matrix  product. 


a{xi,...,XK]a) 


[1  Ui  (I2  *  *  * 


^*(0) 

'Txiv) 

■■  r^{p-l) 

.  r^i-p) 

^^(0) 

7;T 

a 


R^a 


(35) 


Thus,  the  three  methods  contained  in  Equations  33,  34  and  35  provide  a  method  to 
evaluate  the  output  density  of  the  current  frame  with  respect  to  the  state  filter  coefficients. 
For  reestimation,  the  critical  points  of  the  auxiliary  function  with  respect  to  the  filter 
coefficients  and  residual  energy  is  examined,  now  using  the  frame-based  density  function 
given  in  Equation  31.  The  results  are  expressible  in  terms  of  the  autocorrelation  coefficients 
of  the  frame.  Using  Juang’s  notation  of  the  autocorrelation  function  of  the  t-th  frame 
having  length  K, 

K-j 

nii)  =  X)  Ot,kOt,k+j- 

k=l 
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it  can  be  shown  [57,  60]  the  MLE  predictor  coefficients  can  be  solved  through  the  normal 
equations,  Equation  23, 


T-l 


^t(l) 

^*(2) 


n{p) 


n(0)  ^■<(-1)  • 

•  ni-p  +  1) 

ai(l) 

T-l 

n(l)  rt{0) 

■  n(-p  +  2) 

aii2) 

<=i 

.  nip  - 1)  nip  -  2)  • 

n(o) 

.  aiip)  _ 

where  r<(j)  is  the  average  state  autocorrelation  function  expressed  by 


i\j)  — 

E<=1  7l(*) 


(36) 


Denote  the  linear  equations  as 


or  simply 


t  t 


Ti  =  Riai. 


(37) 


Similarly,  the  noise  variance  estimate  uses  the  maximum  likelihood  o^,  which  solves  the 
equation 


T,J=ilS)aI  RjO-i 
KElLi7t(i) 


In  summary,  this  section  demonstrated  the  procedure  when  hidden  state  changes  occur 
at  frame  boundaries  and  hidden  filters  represent  a  linear  dynamic  system  describing  the 
entire  frame. 


3.4-1  Initialization  By  Clustering.  Since  all  of  the  reestimation  schemes  for 
HMMs  are  both  iterative  and  without  theoretical  convergence  to  global  extrema,  the  need 
for  good  initial  models  exists.  Often,  a  uniform  segmentation  process  is  used  to  cluster  data 
into  the  number  of  HMM  states;  these  cluster  centroids  are  then  mapped  to  probabilistic 
distributions.  Depending  on  the  feature  representation,  some  expectation  is  used  within 
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this  uniform  segmentation  process.  It  is  demonstrated  that  for  autoregressive  features, 
also  known  as  linear  predictive  coding  (LPC),  this  sample  mean  produces  non-optimal 
initialization  when  using  an  appropriate  distortion  optimality  criterion. 


3.4-1-1  Spectral  Distortion  Measures.  Each  frame  of  data  can  be  repre¬ 
sented  by  autoregressive  filter  coefficients,  the  autocorrelation  function  or  by  some  other 
spectral  feature,  such  as  Mel  frequency  cepstral  coefficients.  In  order  to  measure  “close¬ 
ness”  amongst  frames  of  data,  a  suitable  distortion  must  be  defined.  Distortions  in  spectral 
shape  or  overall  spectrum  can  make  use  of  mathematical  metrics.  For  example,  the  met¬ 
ric  between  two  log  spectra  results  in 


^2(51,32)  =  ||5i(u)),52(w)||  =  |log5i(u;)  -  log S2{w)\‘^ dw 


Applying  this  metric  to  two  unity  gain  LPC  spectra^  results  in  the  Itakura  measure  [95]. 
Using  the  density  of  a  linear  prediction  coefficient,  the  definition  of  the  Itakura-Saito 
distance  [27]  is  a  form  of  the  Mahalanobis  distance,  defined  as 


dis{aiyO-2) 


(02  -  ffli)^i?a.(a2  -  ai) 
afRai  ai 


(38) 


When  clustering  cepstral  coefficients  for  initial  state  model,  it  turns  out  that  the  L2 
norm  on  the  log  spectra  results  in  the  typical  Euclidean  norm  of  the  cepstral  coefficients. 
Thus,  the  sample  mean  is  the  optimal  cluster  center, 


1 

dl{si,S2)  =  \logSi{w) -logS2{w)fdw 

/•TT  ^  ^ 

=  |ci,n  -C2,nr—  =  Y.  I^l.u  “  C2, 


^The  unity  gain  LPC  spectrum  is  denoted  as 


S{w)  = 


1 

|A(eJ«')P 
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where  a  finite  approximation  using  L  coefficients  is  often  used. 

L 

*^2(^1)  ^2)  ~  ^  ]  \^l,n  ^2,n| 

n=0 

This  will  be  true  even  if  perceptual  weightings  are  performed,  such  as  a  Mel  frequency 
analysis. 


3.4-1-2  Uniform  Segmentation  and  Clustering.  In  order  to  create  an  initial 
model  for  the  reestimation  process,  it  will  be  necessary  to  cluster  the  features  into  initial 
“states.”  Assume  the  frames  are  represented  by  autoregressive  coefficients.  Then,  define 
the  sample  expectation  over  L  autoregressive  or  LPC  vectors  as 

N 

fiis  =  E[ai,a2,...,ai]  =  min  V 

which  is  simply  the  LPC  representation  with  minimum  Itakura-Saito  distortion  to  all  L 
LPC  vectors.  Without  consideration  to  feature  representation,  the  arithmetic  mean  is 
often  used  [95,  134],  denoted  by  JIa 


fJ^A 


Solving  for  the  minimum  of  the  g.js  and  using  Equation  38,  one  seeks  a  which  solves  the 
necessary  optimality  condition 


L 


t=i 

which  occurs,  for  gain-normalized  frames,  as  the  solution  of 


Y^Ria  =  Y^ri  (39) 

i=l  i=l 

where  fi  denotes  the  autocorrelation  function  for  frame  i  and  Ri  denotes  corresponding 
matrix. 
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3.4-1-3  Relationship  to  AR  HMMs.  During  the  reestimation  for  new  state 
filter  coefficients,  the  Baum- Welch  procedure  applied  to  the  frame  AR  HMM  problem 
resulted  in  Equation  36.  This  described  the  new  state  autocorrelation  function,  interpreted 
as  a  weighted  sum  of  individual  frame  autocorrelations,  using  the  weight,  7t(i).  The 
maximum  likely  state  sequence  would  find  those  frames  belonging  to  state  i.  This  would 
result  in  the  new  estimate  for  the  Rh  state  autocorrelation  function  as 

or  the  sample  mean  of  the  frame  autocorrelation  functions.  Thus,  minimizing  the  Baum 
auxiliary  function  for  a  frame  autoregressive  hidden  Markov  model  with  respect  to  the 
state  filters.  Equation  36  and  37,  results  in  minimization  of  the  Itakura-Saito  distortion 
across  those  frames. 

3.4-2  Proof  of  Concept  Trial  Poritz  applied  this  frame-based  hidden  filter  rees¬ 
timation  using  a  5-state  ergodic  architecture  with  simple  third  order  filters.  The  reulsts 
are  shown  for  a  female  speaker  of  the  YOHO  databeise  in  Figure  11,  which  demonstrates 
in  the  inherent  language  modeling  by  this  method.  The  five  states  naturally  form  five 
phonetically-similar  broad  classes  [92,  72].  These  include  silence  (S),  vowels  (V),  nasals 
(N),  liquid-glides  (L),  and  consonants  (C)  as  evidenced  by  the  spectral  characteristics. 

Another  contribution  of  this  research  is  the  extension  of  this  technique  to  model 
the  sample  correlations  within  each  phoneme  separately,  shown  in  Figure  12.  In  order 
to  extract  temporal  information  within  a  phoneme,  a  3-state  left-to-right  model  for  each 
phone  has  been  created,  each  with  a  more  appropriate  12-th  order  predictor.  Not  only 
does  this  architecture  better  model  the  overall  spectrum  of  each  phoneme,  but  the  3-state 
left-to-right  architecture  models  the  transitions  within  a  phoneme.  Another  useful  result 
of  this  method  is  the  ability  to  provide  state-of-the-art  speech  recognition  based  on  these 
type  of  sub-word  models. 
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Figure  11. 


Poritz  Method  Applied  to  a  YOHO  Speaker  -  Showing  Language  Broad  Class 
Modeling.  Right:  The  architecture  provides  an  ergodic  5-state  hidden  filter 
Markov  model.  Note:  not  all  transitions  shown  for  clarity.  Left:  The  five 
filters  attempt  to  model  a  broad  phonetic  category.  These  include  silence  (S), 
vowels  (V),  nasals  (N),  liquid-glides  (L),  and  consonants  (C)  as  evidenced  by 
the  power  spectral  densities  of  the  resulting  filter  estimates. 


3.5  Vector  Hidden  Filter  Markov  Model  Reestimation 

Thus  far,  the  reestimation  of  hidden  filters  operating  on  samples  have  been  developed. 
Options  have  included  samples  with  and  without  a  bias,  autoregressive,  autoregressive 
moving  average  and  frame  based  techniques.  The  third  and  final  level  where  hidden  filter 
Markov  models  may  prove  extremely  useful  is  the  feature  space.  This  level  of  modeling 
first  attempts  to  optimize  the  feature  extraction,  where  relatively  small-sized  vectors  are 
analyzed  at  efficient  rates.  Then,  the  dynamics  within  each  state,  assumed  generated 
by  a  vector  autoregressive  process,  is  estimated.  Begin  with  the  definition  of  a  vector 
autoregressive  hidden  Markov  model.  For  each  state  i, 

Ot  =  -j2AiA^j  +  Wt 

j=i 

where  the  last  expression  Wt  can  be  a  non-zero  mean  multivariate  white  Gaussian  noise 
source  and  the  predictor  matrices  given  for  state  i  are  denoted  by  Aij.  This  equation  is 
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Figure  12.  Extended  Poritz  Method  for  Temporal  Phoneme  Modeling  of  YOHO  Speaker. 

Each  speaker  is  represented  by  21  3-state  left-to-right  monophone  models. 
Shown  for  one  speaker,  the  power  spectrum  of  the  resulting  filter  estimates 
for  all  models  and  all  states. 


expressible  with  a  zero  mean  Gaussian  input,  E^,  as 

Ot  =  p-i  —  -b  El  (40) 

j=i 

where  we  seek  to  estimate  the  Aij  matrices  and  the  state  mean,  /2j.  First,  the  relation  to 
the  standard  multivariate  Linear  Prediction  is  established. 

3.5.1  Multivariate  LPC  Appoach.  For  zero  mean  multivariate  noise,  Kay  [63]  an¬ 
alyzes  the  multidimensional  spectral  estimation  of  vector  Linear  Predictive  Coding  (LPC) 
processes.  The  solution  is  a  matrix  equivalent  of  the  Yule- Walker  equations,  using  the 
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biased  autocorrelation  function  estimator. 


i?.(l) 

Rxio) 

Rxi-1)  ■ 

■  Rxi-ip  - 1)) 

Rx{2) 

— 

Rx{i) 

Rxio)  ■ 

•  Rxi-ip-^)) 

(41) 

Rxip) 

_  Rxip  -  1) 

Rxip -2)  ■ 

Rxio) 

.K. 

where  each  Rx{j)  is  a  (d  x  d)  matrix  corresponding  to  lag  j  of  the  vector  process. 


For  the  non-zero  mean  case,  there  exists  the  less-known  relation  of  the  covariance 
function  satisfying  the  Yule- Walker  equations  [20,  21].  Let  the  estimated  matrix  covariance 
function  be  substituted  for  the  autocorrelation  function  in  Equation  41. 


CxU)  =  RxU)  - 


(42) 


Equation  42  will  be  shown  identical  to  the  technique  of  maximizing  the  Baum  auxiliary 
function  Q{\,  A)  with  respect  to  the  vector  and  matrix  quantities  for  each  state. 

3.5.2  Special  Cases.  Four  cases  can  be  developed  based  on  vector  autoregressive 
modeling: 

•  Diagonal  Aj,  Diagonal  E:  Each  current  observation  dimension  d  separately  is  re¬ 
gressed  on  past  observations,  but  same  dimensions.  The  current  observation  has 
independent  dimensions  (uncorrelated)  as  expressed  by  its  covariance. 

Of  =  /i"  -  ^  Ai{d,  d)OU  +  Ef,  S,*  =  0,  Mj,  k)  =  0,j^k 

i 

•  Full  Aj,  Diagonal  S:  Each  dimension,  in  turn,  is  regressed  on  past  observations,  all 
dimensions.  The  current  observation  still  has  independent  dimensions  (uncorrelated). 

Of  =  M  ^  AiOt-i  +  Et,  Ejfc  =  0,j  ^  k, 

i 

•  Diagonal  Aj,  Full  S:  Each  current  observation  dimension  d  separately  is  regressed 
on  past  observations,  same  dimensions.  The  current  observation  has  full  covariance 
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and  dimensions  may  be  correlated. 


Of  =  -  E  Md,  d)6U  +  Eu  Mj,  k)=0,jj^k 

i 

•  Full  Ai,  Pull  S:  Each  dimension,  in  turn,  is  regressed  on  past  observations,  all  dimen¬ 
sions.  The  current  observation  has  full  covariance  -  dimensions  may  be  correlated. 

Of  —  fJ-  —  E  -^iOi-i  -|-  Et, 
i 

3.5.3  Full  predictor,  Full  Covariance  Reestimation.  Without  any  a  prior  infor¬ 
mation  concerning  the  vector  process,  it  would  be  safe  to  apply  the  full  predictor  with  full 
covariance  equations  to  the  problem.  The  solution  of  the  new  estimates,  begins  by  taking 
the  partial  derivative  of  the  Baum  auxiliary  function  with  respect  to  each  states’  predictor 
and  covariance  matrix.  Using  simplify  notation  [66],  the  logarithm  of  the  output  density 
of  the  multivariate  model  in  Equation  40  is  given  by, 

logbiiOt)  =  C-  i|S,|  -  -  Mi  +  -  M,  +  E  AyO,.,).  (43) 

j=i  j=i 

Make  the  following  matrix  substitutions. 

Bi  =  Ail  Ai2  ...  Ajp  ,  Xf  —  {Ot-iOt-2  ■  ■  ■  Ot-p) 

This  allows  certain  summations  to  appear  as  matrix  multiplications.  The  matrix  equations 
which  satisfy  the  critical  point  of  the  Baum  auxiliary  function.  Equation  18,  using  the 
density  function  of  Equation  43,  become 

N  T-1  .j 

dQ,/dBi  =  J2J2nAj){yt-Fi  +  BiXi)Xf  =  0 

j=l  t=l  ^ 

=  E  ^t{i){YtX^  -  JiiXf  +  BiXiXj)  =  0.  (44) 

t=i 
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Applying  the  gradient  for  the  state  mean  provides 


N  T-1  , 

dQt/dfii  =  +  =  0 

j=i  t=i  ^ 

T-l 

=  J2yMYt-fii  +  BiXt]=0  (45) 

t=i 

Making  the  following  matrix  substitutions, 

Sy  =  Et  7<  ii)Yt  =  Y:t  7<  {i)Xt 

Sy.  =  Et  itmai  Syy  =  Et  limY,^ 

s..  =  Y.nt{i)X^Xf  iv  =  Et7t(0 

then  dropping  state  notation,  Equations  44  and  45  simply  become  the  following. 


Syx  —  flSx  -  BSxX,  Sy  =  N/J,  -  BSx 


(46) 


Lastly,  the  covariance  of  the  noise  source  S  is  estimated  by 

T-l 


=  E  yiii)iYiY,^  +  BY.XJ'  +  BXtY,^  +  BX.XjB'^  -  E  7*(0 

<=i  t=i 

and  dropping  state  notation 


^  =  ^[Syy  +  BS^.  +  SyxB'^  +  BSxxB^-Nfi}!^] 


(47) 


The  solution  of  the  Equations  46,  the  joint  vector-matrix  simultaneous  equations  is 


SyS^-Sy,  =  B{NSxx-S,Sj) 

B  =  iSySj-Sy,){NSxx-SxS^)-^  (48) 

and 

fl  =  ^{Sy  +  BS,).  (49) 

3.5.4  Diagonal  Predictor,  Diagonal  Covariance  Reestimation.  The  choice  of 
speech  spectral  features,  Mel  frequency  cepstral,  support  a  diagonal  predictor  structure. 
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This  decision  is  based  on  the  Discrete  Cosine  Transform  uncorrelating  the  elements  within 
each  vector.  Thus,  one  seeks  the  best  diagonal  Aij  matrices,  such  that  Bi  takes  the  form 

A^i  Ai2  •  •  •  Alp 

and  each  submatrix  Aij  now  resembles, 

am  0  0  0 

0  aii2  0  0 

000  and 

We  observe  that  each  dimension  of  Equation  48  can  be  solved  separately  using  least  squares. 
This  occurs  since  there  is  still  d  *  p  linear  equations  (for  each  dimension)  but  only  p 
unknowns.  Solving  the  filter  coefficients  reduces  to  the  familiar  Yule- Walker  equations, 
substituted  with  the  covariance  quantities  instead  of  the  autocorrelation  ones. 

3.5.5  Numerical  Stability.  Noting  in  Equation  49,  any  imprecision  in  the  current 
filter  affects  both  the  new  covariance  and  mean  estimates.  For  this  reason,  a  similar  model 
which  uses  the  same  mean  estimate  vector  of  a  standard  HMM,  namely  the  a  posteriori 
mean  or  probabilistically  weighted  mean,  is  proposed.  Using  the  vector  autoregressive 
model 

Ot  =  flf  —  '^AijOt-j  +  Ei 
j=i 

where  the  original  jlf  has  been  identified  as  dependent  on  the  filter.  Let  the  new  observa¬ 
tions  (Oj)  be  reduced  by  the  current  state  mean  estimate,  =  Ot  —  /Zj.  Note  that  jlf  is 
not  the  a  posteriori  mean  of  a  standard  HMM,  which  shall  be  denoted  by  p. 

p 

Pi  Mi  "b  ^  ^  Aij  Pi 

j=l 
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The  vector  process  is  zero  mean  within  each  state  or  mixture  and  the  model  becomes, 

i=i 

with  the  estimation  of  matrices  and  the  a  posteriori  state  mean,  /2j  proceeding  accord¬ 
ingly  to  the  standard  hidden  Markov  model  mean  update.  Naturally,  in  both  cases,  when 
p  =  0  or  Aij  =  0,  the  reestimation  reduces  to  the  standard  Gaussian  HMM  model.  Also 
note  that  in  the  single  state  =  1  case,  the  reestimation  is  the  direct  multivariate  LPC 
model  [63]  from  Section  3.5.1. 

3.5.6  Proof  of  Concept  Trial.  To  demonstrate  the  ability  of  this  model  to  extract 
low-pass,  high-pass  and  bandpass  filters  across  different  dimension  from  a  vector  Markov 
state  source,  ten  sequences  of  200  observations  (2  dimensional)  where  created  using  the 
following  multivariate  filters.  The  left-to-right  transition  matrix  has  an  =  0.99. 


1.00 

0.00  -1.20  0.00 

0.00 

Bi  =  \ 

[7|Aii|Ai2]  — 

1 

1  0.429 

0.00 

1.00  0.00 
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1 

1.00  0.00  1.124 

0.00 

0.39 

0.00 

B2  —  [/IA21IA22]  — 
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1 

0.00  1.00  0.00 

-1.237  0.00 

0.775 
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0.00 

Hi  = 
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= 

,  S2  — 

4.0 

0.00  0.02 

-3.0 

0.00 

0.05 

For  the  diagonal  model,  the  estimated  output  density  parameters  were  as  follows: 


1.00 

0.00 

-0.878 

0.00 

0.00 

Bi 

—  [-^1^11  |-4i2]  — 

0.00 

1.00 

i 

0.00 

0.176 

1  0.084 
0.00 

0.819 

=  [-f|A2l|A22]  = 

1.00 

0.00 

1.136 

0.00 

0.00 

B2 

0.00 

1.00 

0.00 

-1.224 

1  0.398 
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1.80 

0.131  0.00 

0.99 

0.10  0.00 

,Si  = 

M2  = 

,22  = 

3.93 

0.00  0.059 

-3.04 

0.00  0.05 

The  power  spectrum  of  the  generator  for  each  state  and  each  dimension  is  shown  in  Figure 
13.  When  using  a  known  feature,  such  as  cepstral  coefficients,  the  predictor  should  be 


Figure  13.  Markov-Modulated  Vector  AR(2)  process. 

diagonal.  For  this  test  signal,  both  full  and  diagonal  predictor  types  were  applied  and  the 
spectrums  of  the  estimated  filters  shown  in  Figure  14. 

3.6  Conclusion 

The  methods  developed  in  this  chapter  now  allow  for  modeling  Markov-modulated 
linear  dynamic  system,  at  the  sample  level,  the  frame  level  and  the  processed  features  level. 
Key  examples  have  shown  their  ability  to  find  ergodic  AR  filters,  ergodic  ARMA  models, 
phoneme-based  frame  level  left-to-right  AR  filters  and  vector  autoregressive  models. 

In  summary,  the  Baum- Welch  reestimation  procedure  follows  a  prescribed  sequence: 
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Figure  14. 


Estimated  Markov-modulated  vector  AR(2)  spectrum  using  both  a  diagonal 
and  full  predictors. 


•  Initial  model  Ao,  often  solved  with  some  clustering  or  segmental  fc-means  procedure 
[Sec  3.4.1] 

•  Forward-Backward  algorithm  solves  for  7<(i)  [Sec  3.2.1  and  3.2.2  ] 

•  Solve  ML  estimates  of  11,  A  using  standard  hidden  Markov  model  procedures  [96] 

•  Solve  simultaneous  equations  for  the  ML  B  output  density  parameters  including: 

—  Sample  AR:  and/or  fii  [Sec  3.3.2,  3.3.3] 

—  Sample  ARMA:  ai,bi,af  [Sec  3.3.5  ] 

—  Frame:  aj,cr?  [Sec  3.4] 

-  Vector:  =  [An Ajp],  S*,  fii  [Sec  3.5] 

•  Repeat  until  convergence. 

Naturally,  each  can  be  extended  to  multiple  mixtures,  with  state  mixture  weighting  similar 
to  the  standard  hidden  Markov  model  approach. 
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Before  exploring  their  effectiveness  experimentally,  several  theoretic  properties  con¬ 
cerning  the  a  priori  classification  and  convergence  of  the  Baum- Welch  learning  must  be 
resolved.  These  are  proven  in  Chapter  IV.  Then  in  Chapter  V,  particular  versions  of  these 
models  will  be  applied  to  the  challenging  problem  of  large  population,  speaker  identification 
and  recognition. 
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IV.  Hidden  Filter  Analysis 


4-1  Introduction 

Several  key  properties  of  hidden  filter  Markov  models  will  be  demonstrated  analyt¬ 
ically  and  experimentally  in  this  chapter.  The  first  justifies  their  use  over  other  methods 
for  pattern  classifications  problems.  Based  on  a  theorem  by  Fielding  [42],  it  will  be  shown 
an  assumed  hidden  filter  Markov  source  reduces  the  joint  entropy  over  an  assumed  Gaus¬ 
sian  mixture  Markov  model.  Secondly,  it  will  be  shown  that  construction  of  an  equivalent 
single  mixture  structure  can  be  generated  for  any  finite  mixture.  If  all  output  densities 
have  the  property  of  negative  log  concavity  for  this  equivalence  model,  then  each  step  of 
the  Baum- Welch  algorithm  will  find  the  global  maximum  for  that  iteration,  as  well  as  the 
overall  convergence  will  be  monotonic.  Lastly,  the  hidden  filter  output  densities  will  also 
demonstrate  the  property  of  negative  log  concavity. 

4.2  Entropy  Analysis  of  Markov  Sources 

Fielding  [42]  recently  provided  a  relation  between  information  theory  and  pattern 
classification.  Entropy,  the  average  measure  of  information  over  a  set  of  observations, 
provides  a  useful  tool  for  comparing  classifiers.  A  classification  system  which  reduces 
uncertainty  in  a  set  of  observations  by  using  useful  assumptions  of  the  source  model,  will, 
reduce  the  probability  of  error  [72].  It  is  therefore  desirable  to  find  models  which  reduce 
joint  entropy,  H{0i,02, .  ■  ■ ,  0„)  defined  over  a  set  of  observations  as 


77(0i,02,...,0„)  =  -^p(0i,02,...,0„)  logp(Oi,02,...,0„). 

N 

The  summation  over  N  accounts  for  all  possible  orderings  of  the  sequence  of  observations. 
Two  key  facts  concern  the  joint  entropy  properties  of  sequences.  The  first,  attributed  to 
Blahut  [12],  is 

i=l 

with  equality  holding  if  the  random  variables  are  independent.  Thus,  the  entropy  of  a 
sequence  will  also  be  less  than  or  equal  to  the  entropy  of  an  individual  observation.  The 
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second  fact  provides  insight  to  Markov  processes.  Let  the  observation  sequence  be  a  p-th 
order  Markov  process.  Then, 

n 

where  Hp{0i,02,  ■  ■  ■ ,  On)  denotes  the  entropy  of  a  p-th  order  Markov  process  and  equality 
holds  for  independence.  Fielding’s  final  results  demonstrate  [42] 

n 

HpiOu02, . . . ,  0„)  <  O2, . . . ,  0„)  <  H{Ou  O2, . . . ,  0„)  <  ^  HiOi)  (50) 

1=1 

where  Hi{0i,02, . . . ,  0„)  as  the  entropy  of  a  first  order  Markov  process.  An  increasing 
Markov  dependency  in  the  sequence  results  in  a  decreasing  joint  entropy.  A  pattern  rec¬ 
ognizer  which  models  this  dependency  should  have  better  classification.  While  Fielding 
chose  the  hidden  Markov  model  as  the  source  model,  this  dissertation  examines  hidden  fil¬ 
ter  Markov  models.  It  will  be  shown  that  the  observations  produced  by  a  hidden  Markov 
models  are  not  a  Markov  process.  However,  if  the  assumed  source  is  a  p-th  order  hid¬ 
den  filter  Markov  model,  then  the  observations  will  be  a  p-th  order  Markov  process  and 
Fielding’s  theorem  applies  directly. 

Lemma  IV.  1  A  hidden  Markov  model  A  generates  observations  which  are  not  Markov, 
but  independent.  Hence, 

p{Ot\Ot-i,...,Oi)  =p(Ot). 

Proof:  Using  the  hidden  Markov  model  standard  assumptions  (Section  2.2.1),  the  con¬ 
ditional  likelihood  is  shown  to  be  unconditioned  on  any  past  observations: 

p(Ot|Ot_i,...,Oi)  =  Y^p{Ot\qt,Ot-i,...,Oi)p{qt) 

Qi 

=  l]p(0<kt)p(?<) 

Qt 

=  p{Ot)  □ 
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Lemma  IV.2  A  p-th  order  hidden  filter  Markov  model  Xp  generates  observations  which 
are  a  p-th  order  Markov  process,  having  the  property, 

p{Ot\Ot-i, . . . ,  Oi)  =  p{Ot\Ot^i, . . . ,  Ot-p) 

Proof:  Disprove  independence  by  using  the  source  output  density,  (Equations  24  or  27). 
For  state  qt, 

p 

p{Ot\Ot^i,. . . ,  Oi,qt{i))  =  Af{Ot  +  aijOt-j,(^]) 
which  clearly  demonstrates  past  observation  dependence.  Therefore, 

qt 

Qt 

—  p{Ot\Ot-i, . . .  ,Ot-p)  □ 

These  two  Lemmas  provide  insight  to  the  following  theorem. 

Theorem  IV.l  Let  Xp  denote  a  p-th  order  Markov  model.  Let  X  denote  a  standard  Gaus¬ 
sian  mixture  Markov  model.  Then,  given  an  observation  sequence  {Oi  . . .  Ox),  the  joint 
entropy  of  this  observation  assuming  a  hidden  filter  source  will  have  less  entropy  than  a 
hidden  Markov  model  source.  That  is, 

H^,{0,...0t)<Hx{0i...0t) 

Proof:  The  hidden  filter  model  Xp  generates  a  p-th  order  Markov  process  by  Lemma  IV.l 
with  joint  entropy  Hp{Oi  . . .  Ox}-  The  standard  hidden  Markov  model  A  produces  obser¬ 
vations  with  joint  entropy  H{Oi  . . .  Ox)  =  J2t=i  H{Ot)  by  Lemma  IV.2.  Direct  application 
of  Fielding’s  theorem  given  by  Equation  50  completes  this  proof.  □ 
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This  theorem  provides  justification  for  hidden  filter  Markov  models  in  pattern  recog¬ 
nition  problems.  Similar  to  arguments  made  by  Le  Chevalier  [68]  and  Libby  [74],  if  a 
classifier  uses  an  algorithm  to  account  for  this  Markov  dependency  within  a  sequence, 
recognition  will  increase.  By  using  maximum  likelihood  parameter  estimates,  we  inher¬ 
ently  assume  the  working  model  is  the  same  as  the  source  model  which  generated  the 
observations.  For  example,  when  using  hidden  Markov  modeling,  it  is  assumed  the  source 
is  a  hidden  Markov  model.  For  observation  sequences  which  appear  correlated  over  partic¬ 
ular  changing  blocks  of  the  sequence,  it  should  be  assumed  the  source  is  some  hidden  filter 
Markov  model.  Now  that  the  model  is  justified,  the  next  sections  analyze  some  important 
properties  of  the  learning  algorithm. 

J^.3  Monotonic  Reestimation 

One  property  of  the  Expectation  Maximization  (EM)  algorithm  guarantees  the  like¬ 
lihood  of  the  observations  given  the  model  is  increased  whenever  the  auxiliary  function 
is  increased.  For  hidden  Markov  models  with  unimodal  log  concave  output  densities, 
Baum  and  Petre  [10]  demonstrated  that  each  EM  iteration  steps  to  the  global  maximum 
of  the  auxiliary  function.  This  is  shown  true  for  a  single  Gaussian  output  density  and 
was  extended  to  the  more  general  elliptically  symmetrical  density  function  by  Liporace 
[76].  Extending  an  architectural  concept  introduced  by  Rabiner  [96],  it  is  demonstrated 
that  HMMs  with  mixture  components  can  be  recast  into  an  equivalent  model  with  only 
unimodal  state  densities  and  a  particular  transformed  probability  transition  matrix.  Thus, 
Gaussian  mixture  models  are  now  guaranteed  to  step  to  the  global  maximum  of  the  aux¬ 
iliary  function  each  iteration  of  the  Baum- Welch  algorithm.  Lastly,  an  examination  of 
conditional  densities,  such  as  hidden  filter  models  with  and  without  mean,  demonstrates 
they  also  maintain  log  concavity  and  results  in  optimal  global  maximum  steps. 

4.3.1  Single  Mixture  Gaussian  HMM.  First,  the  properties  of  the  Baum- Welch 
(or  Expectation  Maximization)  algorithm,  specifically  when  the  output  densities  are  neg¬ 
ative  log  concave  in  the  parameters,  will  be  reviewed.  Recall,  the  scaled  auxiliary  function 
Q(A,  A)  can  be  maximized  for  each  of  the  main  parameter  sets  A  =  (H,  A,  B)  separately 
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for  each  state.  Specifically  for  the  new  output  densities, 

T 

Qi{X,Bi)  =  ^p(qt  =  i\Oi...OT,X)^ogbi{Ot)  (51) 

<=i 

t=i 

Baum  examines  the  conditions  on  bi{Ot)  to  insure  a  critical  point  is  also  a  global  maxi¬ 
mum  over  all  new  models,  A.  A  proof  using  transformed  observations  and  a  log  concave 
{bi{Ot)"  <  0)  property  was  used  by  Baum  and  colleagues  [10]  showing  Q  has  a  negative 
second  derivative  at  a  critical  point.  Liporace  [76]  then  provides  a  more  general  proof 
for  elliptically  symmetrical  densities.  So  for  any  HMM  with  single  Gaussian  density  func¬ 
tions,  each  step  of  the  EM  algorithm  is  guaranteed  to  increase  the  likelihood  function 
monotonically  by  stepping  to  the  maximum  of  the  Q  function. 

4.3.2  Multiple  Mixture  Gaussian  HMM.  In  practice,  multiple  mixtures  are  used 
to  model  more  complex  distributions  of  data  within  each  state.  However,  since  log5j(0<) 
no  longer  satisfies  negative  log  concavity,  Baum’s  Theorem  [10]  no  longer  holds.  His  proof 
used  a  centered  process  to  attain  a  unit  normal  with  zero  mean.  Since  a  mixture  density 
does  not  satisfy  this  structure,  his  theorem  no  longer  applies  to  multiple  mixtures. 

4.3.2. 1  Rabiner  Model.  Rabiner  presents  a  similarity  between  Gaussian 
mixtures  and  models  with  extra  states  [95].  However,  his  analysis  required  special  non¬ 
emitting  entry  and  exit  states  for  each  mixture.  Also,  this  theoretical  architecture  is  not 
easily  verified  with  existing  implementations  due  to  these  non-emitting  states.  Though 
these  special  states  could  be  analyzed  as  being  a  trivial  zero  “probabilistic  function”  of  a 
Markov  state  sequence,  they  would  not  lend  themselves  to  theoretical  convergence  proofs. 
Another  similarity  transformation  needs  to  be  defined. 

4. 3. 2. 2  Equivalence  Model.  These  non-emitting  states  can  be  transformed 
in  a  special  structure  entirely  defined  within  the  Markov  state  transition  matrix.  Let  a 
constructive  example  show  this  fact  (Figure  15). 
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Figure  15.  Functional  equivalence  of  HMM  A  and  Equivalence  model  A^.  Top  model 
(multiple  mixture)  can  be  recast  as  bottom  (single  mixture)  with  a  particular 
transition  structure.  (-)  denotes  uninvolved  transitions. 


Each  state  can  be  expanded  into  substates  with  the  transitions  being  products  of  the 
original  transitions  and  the  mixture  weights.  The  following  2  matrices  show  an  original  1 
state  -  2  mixture  HMM  generator  A  and  the  theoretical  equivalent  model  A^ 


Ax  = 


-  1.00  - 

-  0.30 

0.70 

- 

-  0.96  0.04 

II 

< 

-  0.29 

0.67 

0.04 

-  0.29 

0.67 

0.04 

^When  not  directly  applicable  to  the  state  transformation,  unaffected  values  have  been  shown  as 
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The  original  model  generated  100  variable  length  sequences.  Both  models  were 
trained  using  the  Baum- Welch  algorithm  of  Chapter  III,  initialize  to  equivalent  random 
parameters.  Table  2  provides  the  original  output  density  parameters  and  those  learned  for 
both  architectures. 


Table  2.  Actual  and  Learned  (Baum- Welch)  Output  Densities. 


Parameter 

Actual  A 

Learned  A 

Learned  A^; 

Mean 

Variance 

Mixtures 

3.53,  -1.98 
0.74,  0.22 
0.30,  0.70 

3.60,  -2.00 
0.68,  0.24 
0.29,  0.71 

3.60,  -2.00 
0.68,  0.24 

The  final  estimates  of  the  transitions  matrices  are  as  follows,  denoted  by  and 
Note  the  similarity  to  the  the  original  and  the  theoretical  equivalent.  Figure  16  shows  the 
monotonically  increasing  log-likelihoods  for  each  iteration  of  the  Baum- Welch  algorithm. 
Though  the  two  models  have  different  architectures,  both  converge  to  the  equivalent  overall 
model  having  -29.45  log-likelihood. 


-  1.00 

- 

-  0.32 

0.68 

- 

II 

-  0.94 

0.06 

-  0.28 

0.67 

0.05 

-  0.000 

0.00 

-  0.27 

0.67 

0.06 

Having  constructed  the  model  and  shown  equivalence  by  example,  the  formal  definition 
for  an  Equivalence  Model  is  as  follows. 
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2  mixture 


2  state 


Figure  16.  Learning  the  maximum  likelihood  models  from  10  random  starts  based  on  the 
architectures  of  the  original  HMM  A  (2  mixture)  and  theoretical  equivalence 
model  (2  state).  Almost  all  models  converged  to  the  same  equivalent 
log-likelihood  value  of  —29.45. 


Theorem  IV.2  (Equivalence  Model)  Given  a  hidden  Markov  model  A  such  that  bi{0)  is 
a  state  density  function  consisting  of  a  finite  convex  combination  of  negative  log  concave 
densities 

M  M 


biiG)'^  ^  'j  ^ikbjki^G'^ ^  such  that  cif^  —  1,  cn*  ^  0 

fc=l  k=l 


an  Equivalence  Model  A^  exists  which  is  functionally  equivalent  to  such  that  each  orig¬ 
inal  state  i  is  expanded  into  M  substates,  with  the  following  properties: 


•  Each  substate  of  Xe  is  described  by  one  ofbik{0); 

•  The  state  transition  matrix  entries  of  Xe,  for  the  original  state  i,  new  substate  k,  are 
given  by. 


A 


Kife) 


*  ^ik 

^k,k  ’  ^ik 
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Proof:  Proof  by  construction. 

Furthermore,  if  satisfies  the  properties  of  Liporace  Theorem  2  [76]  or  Baum 

[10]  then  Q(Ajs,  Xe)  has  a  unique  global  maximum  as  a  function  of  Xe,  for  fixed  Xe-  The 
results  of  this  Theorem  IV. 2  insure  that  each  step  of  the  Baum- Welch  algorithm  for  mixture 
densities  will  increase  the  likelihood  of  the  model  -  demonstrated  in  Figure  16. 

4-4  Monotonic  Reestimation  of  Hidden  Filters 

While  the  previous  analysis  was  presented  for  standard  hidden  Markov  models,  sim¬ 
ilar  results  will  be  extremely  beneficial  for  hidden  filter  model.  It  was  discussed  that  a 
desirable  property  of  the  output  density  function  was  either  1)  negative  log  concavity  or 
2)  elliptically  symmetric.  This  section  demonstrates  that  hidden  filters  also  demonstrate 
this  property.  The  approach  of  Baum  and  later  Liporace  examined  the  negative  definite 
property  of  the  second  derivative  of  the  auxiliary  function  with  respect  to  the  space  of  new 
models  A.  The  following  proof  takes  a  similar  approach. 

Again,  the  scaled  auxiliary  function  Q(A,  A)  can  be  maximized  for  each  of  the  main 
parameter  sets  A  =  (II,  A,  B)  separately  for  each  state,  Equation  52.  It  will  be  shown 
that  the  auxiliary  function  is  negative  definite  for  the  space  of  reestimated  filter  models 
A.  This  is  most  easily  demonstrated  using  a  similar  approach  to  Liporace  [76].  This 
method  first  chooses  two  arbitrary  models,  Ai  and  A2  and  defines  a  new  model  A  which  is 
a  linear  (convex)  combination  of  these  two.  It  can  then  be  shown  that  any  linear  convex 
combination  of  these  models  is  negative  definite.  Since  Ai  and  A2  are  arbitrary,  it  suffices 
that  the  entire  space  of  new  models  is  negative  definite.  Intuitively,  the  space  must  have 
only  one  global  maximum  at  the  single  critical  point  of  the  auxiliary  function  which  is 
where  the  Baum- Welch  algorithm  steps.  Concisely, 
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Theorem  IV. 3  (Hidden  Filter  Auxiliary  Global  Maximum)  Given  a  hidden  filter  Markov 
model  such  that  the  state  i  conditional  density,  potentially  with  mean,  is  given  by 

logbfiOt)  =  -(1/2) log 27r-  (1/2) log 0-,^  -  -^{Ot  -  fi, +  '^aikOt^kY 

*  k=l 

then  the  Baum  auxiliary  function,  pertaining  to  the  output  densities, 

Qb{\X)  =  JZ'l*  -(l/2)log27r-(l/2)log0^^-;^(O4-/ii  +  ^ai*,Ot_fc)2  (52) 
t  I  k=l 

has  a  single  global  maximum  for  fixed  A. 

Proof:  For  the  unidimensional,  single  order  case,  p  =  1,  let  the  predictor  coefficient 
be  denoted  by  b.  The  Baum  auxiliary  function  defined  in  Equation  18  can  be  maximized 
separately  for  each  state  (Equation  52).  Drop  the  state  notation  and  define  the  reestimated 
model  as 

X  =  eXi  +  {l-  e)X2  (53) 

for  0  <  0  <  1  where  the  new  model  A  is  a  linear  combination  of  two  arbitrary  ones.  Now 
examine  the  partial  derivative  of  the  auxiliary  function  with  respect  to  9,  still  updating  to 
a  critical  point.  Equation  53  implies  the  following  is  true. 

/2  =  Oni  +  (1  -  0)iJi2 
a  =  9ai  +  (1  -  6)0-2 
b  =  ebi  +  {l-  0)b2 

Letting  c  =  l/cr^  >  0, 

T 

d‘^Qb{X,X)lde'^  =  ^^J-i(£L^^-c((6i-52)0,_i-(pi-p2))2 
<=1  L  ^ 

-2((5i  -  b2)Ot-i  -  im  -  P2))(Ot  -  A  +  bOt-i){ci  -  C2) 

T 

[-5^^  -  «((«'!  -  -  (Ml  -  M2))^ 

t=l  L  ^ 

— 2(5i  —  b2)Ot-i{Ot  —  M  +  bOt-i){ci  —  C2) 
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+2(^1  -  /X2)(0t  -  n  +  bOt-i)ici  -  C2)] 


(54) 


Now,  seek  the  second  partial  only  at  a  critical  point,  implying  both  dQ/dn  =  0  and 
dQ/db  =  0.  Expanding, 

T 

dQ,iX,X)/db  =  + 

t=l 

implying 

T 

0  =  X^7i[-2(ci-C2)(6i-62)Ot-i(Oi-/x  +  6)Ot_i)]  (55) 

t=i 


dQbiK  A)/5/x  =  -  m)]  =  0 

<=i 

implying 

T 

0  =  [2(^1  “  A‘2)(ci  -  C2){Ot  -p.  +  5)C>i-i)]  (56) 

t=i 

These  last  two  expressions  (Equations  55  and  56)  cancel  the  last  two  terms  in  Equation  54 
leaving  a  negative  sum  of  squared  positive  terms.  Since  the  sum  is  negative  for  all  choices 
of  Ai  and  A2,  the  auxiliary  function  is  negative  definite  at  the  critical  point.  Also,  if  there 
were  two  critical  points,  the  auxiliary  function  would  have  to  switch  positive  for  some  pairs 
of  Ai  and  A2.  Since  this  was  not  evident,  then  only  one  critical  point  must  exist  and  it  is 
the  global  maximum.  □ 

Naturally  this  result  for  a  single  conditional  density  applies  to  mixtures  of  conditional 
densities  whereby  the  previous  section  constructively  demonstrated  a  simpler  equivalence 
model  exists.  Applying  both  results  of  this  section  and  the  last  concludes  that  multiple 
mixtures  of  hidden  filters  can  be  transformed  into  an  equivalent  model  with  single  filters 
per  state,  and  each  Baum- Welch  iteration  will  step  to  the  global  maximum  of  the  auxiliary 
function.  Multiple  iterations  of  Baum- Welch  will  monotonically  increase  the  likelihood  of 
the  reestimated  parameters. 
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4-5  Conclusion 

Based  on  the  assumed  source  model,  it  was  first  proven  that  a  hidden  filter  Markov 
model  sequence  has  less  joint  entropy  than  a  sequence  generated  from  a  standard  hidden 
Markov  model.  Pattern  recognizers  based  on  these  correct  models  should  exhibit  lower 
classification  errors.  Gaussian  mixture  hidden  Markov  models  have  been  demonstrated  to 
be  equivalent  to  single  mixture  larger  models,  with  increased  states.  This  allows  currently 
known  theorems  relating  to  the  convergence  properties  of  the  algorithm  to  be  satisfied. 
Likewise,  for  conditional  state  density  functions,  the  Baum  auxiliary  function  is  guaranteed 
to  have  a  global  maximum  at  the  single  critical  point  which  is  achieved  for  each  iteration 
of  the  Baum- Welch  algorithm.  In  summary,  the  direct  application  of  the  reestimation 
equations  in  Chapter  III  guarantees  better  models  each  iteration  they  are  applied,  and 
further  implying  they  monotonically  converge  in  likelihood.  The  next  chapter  uses  the 
reestimation  outlined  in  Chapter  III,  with  the  insight  of  the  algorithmic  properties  outlined 
in  this  chapter,  for  the  application  of  modeling  speaker  dependent  phonemes  for  speaker 
recognition. 
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V.  Speaker  Recognition 


5. 1  Introduction 

This  chapter  describes  the  extensive  experimentation  and  evaluation  of  the  hidden 
filter  Markov  modeling  approach.  The  next  section  reminds  the  reader  of  why  this  ap¬ 
plication  requires  better  techniques.  A  systems  level  description  is  first  provided,  shown 
in  Figure  17.  The  YOHO  database  is  described  and  used  for  all  experiments,  with  initial 
experiments  applied  to  speaker  identification.  Where  appropriate,  all  methods  compare 
results  to  vector  quantization,  a  well-proven  technique  for  text-independent  speaker  mod¬ 
eling.  Speaker  verification,  an  extremely  difl[icult  problem,  compares  log-likelihood  ratios 
to  a  posteriori  globally  determined  thresholds.  Three  methods  of  normalization,  using 
close  cohort  speakers  as  a  reference,  are  examined,  with  a  second  order  approach  being 
developed  in  this  research.  Lastly,  an  important  general  pattern  recognition  concern  is 
analyzed,  which  answers  the  question,  “Does  my  system  meet  requirements?”  It  will  be 
shown  that  a  particular  configuration  of  our  system  meets  the  stringent  U.S.  Government 
requirement  of  1%  false  reject  and  0.1%  false  acceptance  rates. 

5.2  Why  Better  Speaker  Recognition? 

The  National  Institute  of  Standards  and  Technology  (NIST)  recently  provided  a 
set  of  guidelines  [86]  to  Federal  agencies  and  departments  for  verifying  the  identities  of 
computer  system  users.  They  describe  biometric-based  authentication  as  the  measurement 
of  a  unique  biological  feature  used  to  verify  the  claimed  identity  of  an  individual  through 
automated  means.  Biometric  authentication  mechanisms  will  attempt  to  measure  a  unique 
biological  feature  to  the  degree  that  only  one  person  may  be  authenticated  as  a  specific 
user.  The  biological  feature  may  be  based  on  a  physiological  or  behavioral  characteristic 
as  remarked  in  Chapter  I.  The  physiological  characteristics  measure  vocal  tract  and  other 
speech  production  physiology  while  the  behavioral  characteristics  measure  all  other  voice 
habits  and  patterns.  This  chapter  examines  hidden  filter  Markov  modeling  of  phonemes 
for  this  identification  and  verification  process. 

Campbell  writes  [18] 
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The  LA  Times  recently  reported  that  $1.2  billion  is  lost  annually  from  telephone 
calling  card  fraud  and  the  accounting  firm  of  Ernst  and  Young  estimates  that 
high-tech  computer  thieves  in  the  U.S.  steal  $3  to  $5  billion  annually. 

The  use  of  automatic  speaker  recognition  could  reduce  these  thefts  substantially.  In  addi¬ 
tion  to  these  problems,  legislation  is  being  considered  to  automate,  nationwide,  the  elec¬ 
tronic  distribution  of  welfare  benefits  using  voice  verification  [36]  among  other  techniques. 

As  introduced  in  Chapter  I,  speaker  recognition  includes  speaker  identification  and 
speaker  verification.  When  performing  verification  or  authentication,  the  errors  can  be 
categorized  by  two  measures,  the  False  Acceptance  Rate  (FAR)  and  the  False  Rejection 
Rate  (FRR).  The  FAR  {Type  2  errors)  represents  the  percentage  of  unauthorized  users  who 
are  incorrectly  identified  as  valid  users.  The  FRR  {Type  1  errors)  represents  the  percentage 
of  authorized  users  who  are  incorrectly  rejected. 

All  experiments  were  performed  on  the  Linguistic  Data  Consortium’s  (LDC)  YOHO 
database,  with  initial  identification  results  providing  insight  to  the  more  extensive  verifi¬ 
cation  experiments.  Following  these  experiments,  a  hypothesis  analysis  will  provide  the 
maximum  critical  errors  allowed  while  still  meeting  the  goal  levels  specified  of  1%  FR  and 
0.1%  FA. 

5.3  YOHO  Database 

The  YOHO  Speaker  Verification  database  is  the  only  large  scale^,  scientifically  con¬ 
trolled  and  collected,  high-quality  speech  database  for  speaker  authentication  testing  at 
high  confidence  levels.  This  corpus  has  been  designed  to  test  speaker  verification  at  U.S. 
Government  required  error  rates  of  1%  false  rejection  and  0.1%  false  acceptance  [17,  67], 
with  a  goal  level  of  one  magnitude  better.  (0.1%  False-Reject  and  0.01%  False- Accept). 
The  138  subjects,  106  males  and  32  females,  were  asked  to  participate  in  14  sessions  over 
a  3- month  interval.  These  included  4  enrollment  sessions  of  24  utterances  each  and  10 
verification  sessions  of  four  utterances  each. 

^When  uncompressed  the  raw  speech  consists  of  1.2  gigabytes  of  data  [67]. 
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Figure  17.  Speaker  recognition  system  overview. 


The  speech  material  consists  of  “combination-lock”  phrases.  An  example  prompt  is: 
“57  -  26  -  64”,  pronounced  “fifty-seven,  twenty-six,  sixty- four”.  Each  phrase  consists  of 
three  number  doublets.  The  doublets  are  chosen  from  a  list  which  includes  all  the  doublets 
from  21  to  99  with  the  following  exceptions:  (1)  no  exact  decades  (30,  40,  etc.),  (2)  no 
double  digits  (22,  33,  etc.),  and  (3)  no  numbers  ending  in  “8”  (28,  38,  etc.).  Pausing 
between  the  doublets  is  optional,  but  not  encouraged  [67].  The  total  number  of  words  is 
sixteen  producing  56  possible  doublets  and  a  list  of  166,320  phrases. 
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S-Jf.  Phonetic  Labeling  and  Training 

Using  the  full  TIMIT  database  [56],  single  mixture  -  3  state  models  were  previously 
trained  by  Anderson  [3]  based  on  12  MFCC,  12  A  MFCC  and  12  AA  MFCC,  including 
log  energy,  A  log  energy  and  AA  log  energy.  The  full  set  of  Kai-Fu  Lee’s  49  phoneme 
models  [69]  allowed  segmentation  and  labeling  of  the  (as  yet  unlabeled,  but  transcribed) 
YOHO  database. 

5.4-1  Forced  Viterhi  Alignment.  Viterbi  decoding  [48],  for  a  single  hidden  Markov 
model,  provides  the  most  probable  state  sequence  given  an  observation  sequence.  The 
algorithm  also  provides  overall  likelihood  of  the  sequence.  Since  the  transcriptions  are 
provided  for  each  enrollment  utterance,  a  network  of  phoneme  models  which  must  be 
traversed  from  beginning  to  end  is  known.  Consider  building  a  very  large,  single  hidden 
Markov  model  from  the  individual  phoneme  models.  The  Viterbi  algorithm  can  uncover 
the  most  likely  state  sequence  which  in  turn  provides  a  phoneme  label  for  each  analysis 
frame.  Thus,  the  forced  Viterbi  procedure  constrains  the  decoding  of  an  input  observation 
sequence  to  a  ordered  list  of  word  and  phoneme  transcriptions.  While  there  is  not  yet  a 
substitute  for  hand  segmentation  by  a  phonetician,  the  overall  process  is  fast,  efficient  and 
remarkably  reliable  with  a  good  set  of  trained  models. 

Table  3  provides  the  initial  TIMIT  phoneme  list  for  monophone  models  constrained 
to  the  YOHO  vocabulary.  See  also  Appendix  5. A  for  example  words  using  these  phonemes 
and  the  actual  language  grammar.  The  YOHO  vocabulary  has  19  monophones,  with  an 
additional  /sil/  (leading  and  trailing  silence)  and  /sp/  (interword  space).  The  /DX/  is  not 
used  in  the  TIMIT  grammar. 

Table  3.  YOHO  phoneme  model  List,  with  silence  (sil)  and  interword  space  (sp). 


(DX) 

ER 

F 

lY 

B 

sil 

AO 

AY 

EH 

EY 

IH 

K 

B 

B 

sp 

The  relative  proportions  for  each  phoneme,  over  the  entire  YOHO  database,  is  pro¬ 
vided  in  Figure  18.  As  evident  from  the  graph,  the  enrollment  data  follows  the  identical 
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distribution  as  the  test  data^.  These  results  indicate  there  is  adequate  coverage  of  the 
phoneme  space  within  the  enrollment  data  [19]. 


Figure  18.  Histogram  of  all  YOHO  enrollment  and  verification  utterances,  after  a  forced 
Viterbi  segmentation  bootstrapped  from  TIMIT. 


5. 1^.2  Embedded  Reestimation.  Once  the  entire  YOHO  database  was  phoneti¬ 
cally  marked,  all  four  sessions  of  emollment  data  were  used  to  train  speaker  dependent 
models.  The  phoneme  models  were  reestimated  individually  using  the  Baum  Welch  algo¬ 
rithm,  with  the  initial  model  being  the  speaker  independent  TIMIT  trained  models,  when 
possible.  When  not  possible  due  to  the  architecture  involved,  an  initialization  procedure 
consisted,  for  each  monophone  separately,  as  follows:  1)  Uniform  segmentation  into  states 
2)  Segmental  k-means  based  on  Viterbi’s  most  likely  state  sequence.  Then,  an  embedded 
reestimation  of  all  speaker  models  was  accomplished  by  concatenating  the  individual  mono¬ 
phones  for  the  utterances  and  updating  all  models  simultaneously  using  the  Baum- Welch 
algorithm  [134]. 

^Enrollment  data  accounts  for  only  0.057%  of  the  total  phrases  possible. 
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5.5  Speaker  Identification  Results  on  YOHO 

Speaker  identification  uses  a  Bayesian  classifier,  assuming  equal  priors,  choosing 
speaker  model  i  from  the  normalized  Viterbi  log  likelihoods  for  an  utterance,  or  set  of 
utterances,  U.  These  results  provide  a  reference  for  speaker  separation,  model  and  feature 
choice  trade-offs.  Speaker  identification  is  provides  an  upper  bound  on  verification  error 
[34],  based  on  entropy. 

i  =  argmax  {  \ogp{U\Xk)  }  (57) 

First,  an  examination  of  vector  quantization  (VQ)  provides  a  baseline  on  the  YOHO  males 
and  females  separately.  The  VQ  procedure  assumes  independent  observations  and  clusters 
speech  without  any  temporal  assumptions.  Next,  the  frame  and  vector  autoregressive 
techniques  are  applied  to  the  YOHO  database.  These  identification  results  select  the 
alternative  techniques  which  will  be  further  investigated  for  verification. 

5.5.1  Vector  Quantization.  The  classic  approach  to  modeling  speakers  creates 
a  representation  of  their  spectral  vectors  [4,  6,  47,  119],  in  this  case  the  Mel  frequency 
cepstral  vectors.  Codebooks  were  derived  using  the  Linde-Buzo-Gray  (LBG)  clustering 
algorithm  [75]  over  all  enrollment  sessions  until  convergence.  Test  results  are  derived  from 
the  Euclidean  minimum  distortion  over  all  test  utterance  frames.  Table  4  shows  the  closed 
set  speaker  identification  for  both  32  and  64  codeword  models  testing  with  1,  2  and  4 
combinations  phrases.  These  results  serve  as  the  baseline  performance  for  a  non-temporal 

Table  4.  Closed-set  speaker  identification  error  Rates(%)  for  1,2  and  4  combination  lock 
phrases  applied  to  both  32  and  64  VQ  codeword  models.  Features  consisted  of 
the  12  dimensional  Mel  frequency  cepstral  coefficients  (MFCC)  only. 


Method 

Males(Females) 

1 

2 

4 

VQ  -  32  codewords 
VQ  -  64  codewords 

6.86  (6.17) 
4.27  (2.97) 

2.50  (2.50) 
1.65  (1.25) 

1.41  (0.94) 
1.04  (0.63) 

model  applied  to  YOHO. 
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5.5.2  Phonemic  Frame  AR  Hidden  Filters.  Poritz  [92]  proposed  the  fundamental 
hidden  filter  model  for  speaker  identification.  His  architecture  consisted  of  a  5-state  ergodic 
model  using  third  order  filters.  This  section  extends  the  training  to  all  individual  phoneme 
models  and  reestimates  all  simultaneously  using  the  Embedded  Baum- Welch  algorithm. 
The  following  approach  (see  Figure  19)  uses  labeled  phoneme  enrollment  data  to  initialize 
the  hidden  filters  and  builds  networks  for  embedded  Baum- Welch  reestimation  based  on 
transcriptions  and  word  dictionaries.  The  result  is  a  set  of  left-to-right  speaker-dependent 
models,  corresponding  to  specific  phones.  A  forced  Viterbi  alignment  using  utterance 
transcription  and  word  dictionaries  provides  the  overall  log  likelihood  score. 

Unlike  the  VQ  case  which  requires  some  spectral  representation  processing,  the  frame 
AR  hidden  Markov  model  only  requires  the  autocorrelation  of  the  raw  samples.  The  frames 
must  be  gain  normalized,  since  raw  autocorrelation  features  vary  greatly  with  signal  energy. 
It  was  also  demonstrated  in  Section  3.4  that  by  using  the  Poritz  method  on  frames,  only 
the  autocorrelation  coefficients  are  required  in  the  reestimation.  The  resulting  p-th  order 
phoneme  hidden  filters  are  those  which  minimize  the  Itakura-Saito  distortion  to  all  frames 
assigned  to  a  hidden  state.  Table  5  shows  closed  set  speaker  identification  error  rates  for 
various  p-th  order  filters  and  architectures.  While  these  results  are  competitive  to  vector 
quantization,  we  further  examine  models  using  the  vector  Mel  frequency  cepstral  process. 


Table  5.  Closed-set  Speaker  Error  Rates  (%)  using  Poritz  Phoneme  Models  on  1,  2  and 
4  combination  phrases.  Monophones  consist  of  either  1  or  3-state  left-to-right 
models  with  filter  order  p. 


Method 

Females 

1 

2 

4 

1-state,  p=8 

11.09 

5.00 

2.19 

1-state,  p=10 

8.91 

2.34 

0.94 

1-state,  p=12 

8.20 

2.81 

0.63 

3-state,  p=12 

4.84 

1.72 

0.94 

5.5.3  Vector  Hidden  Filters.  The  vector  autoregressive  hidden  filters  can  be 
easily  related  to  many  existing  statistical  speaker  recognition  approaches.  Denoting  the 
number  of  states  as  N,  the  number  of  mixtures  per  state,  M,  and  the  predictor  matrices. 
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Speaker  Dependent,  3-state  left-to-right,  p-th  order 
Frame  autoregressive  hidden  filter  Markov  models 


Figure  19.  The  phoneme  frame  autoregressive  hidden  filter  approach  models  individual 
phonemes  as  3-state  left-to-right  hidden  filters.  The  A  denotes  a  p-order 
hidden  filter,  as  reestimated  in  Section  3.4. 

Bi  =  [Ail, . . .  ,Aip\,  then  the  following  models  in  Table  6  are  attainable.  Note  that  the 
hidden  Markov  model  is  attained  when  the  hidden  filter  predictor  matrices  are  all  set 
to  zero.  A  hidden  Markov  model  based  on  A  coefficients  (Section  2.4.3)  is  related  by  a 
particular  choice  of  matrices  being  set  to  I  and  —I  respectively. 
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Table  6.  Relationship  of  Vector  AR  Hidden  Filters  to  Other  Models.  Note:  W  denotes 
the  size  of  the  A  window. 


Vector  Quantization 

Bi  =  0,Ei  =  =  1, 

All  Transitions  aij  equiprobable. 

Gaussian 

=  0,N  =  1,  M  =  1 

Gaussian  Mixture  Model 

=  0,N  =  1 

Hidden  Markov  Model  (baseline) 

Bi  =  0,\/i 

Hidden  Markov  Model,  (A  coeffs) 

Ai  =  1,  A2W  =  —If  Ai  =  0,  all  other  i 

Vector  AR  Hidden  Filter  Markov  model 

Unconstrained  reestimation 

However,  the  correct  choice  of  model  filter  order  remains  a  difficult  procedure,  for 
any  linear  system  [63,  94].  Several  single  state  models  have  been  examined,  with  in¬ 
creasing  filter  order.  Appendix  5.B  examines  penalty  function  methods  for  correct  model 
order  selection.  For  hidden  filter  Markov  models,  this  analysis  is  unique.  By  increasing 
the  filter  order,  the  residual  variance,  or  prediction  error,  decreased,  but  all  vector  hidden 
filter  models  lacked  the  ability  to  distinguish  between  speakers  or  phonemes.  Others  have 
paradoxically  noted  better  likelihood  scores,  yet  decreased  recognition.  We  propose  an 
explanation  for  this  phenomena,  detailed  in  Appendix  5.C,  based  on  the  strict  stationarity 
of  the  original  speech  samples.  All  further  results  will  be  based  on  a  zero-th  order  vector 
hidden  filter  Markov  model,  i.e.  HMM.  We  continue  to  extract  and  model  context  infor¬ 
mation  by  examining  both  first  and  second  order  regressive  coefficients  within  this  zero-th 
order  architecture. 

Table  7  shows  error  rates  for  various  numbers  of  combination  lock  phrases.  For  each 
gender,  two  different  Viterbi  constraints  were  examined.  Forced  Viterbi  alignment  and 
Word-Pair  Grammar^.  The  latter  can  be  used  to  check  if  the  prompted  text  matched  the 
most  likely  Viterbi  label  hypothesis.  The  Word-Pair  grammar  also  catches  many  confused 
doublets  over  a  simple  word  dictionary  grammar.  For  example,  for  the  prompt  “75-29-47”, 
Viterbi  with  word  grammar  only  may  hypothesize  a  transcription  of  “seventy-five-one- 
nine-forty-seven”  where  this  label  is  not  valid  under  a  word  pair  grammar,  nor  is  it  a  valid 

®In  addition  to  forced  Viterbi  and  Word-Pair,  one  easily  could  perform  Word-Only  grammar  or  No¬ 
grammar  Phoneme  decoding.  It  will  be  shown  that  allowing  impostor’s  greater  decoding  flexibility  decreases 
the  separation  between  true  user  and  impostor  log-likelihood  scores. 
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YOHO  transcription.  Given  the  superiority  of  the  forced  Viterbi  alignment  procedure, 
all  remaining  results  chose  forced  Viterbi  alignment  based  on  the  prompted  transcription. 
An  analysis  concerning  the  entropy  of  the  language,  induced  by  the  choice  of  grammar, 
dictates  forced  Viterbi  is  most  suitable  for  the  speaker  recognition  problem.  See  Appendix 
5.D  for  full  details. 


Table  7.  Closed-Set  Speaker  Error  Rates(%)  with  Viterbi  Constraints  for  1,2  and  4  com¬ 
bination  phrases. 


Method 

Males(Females 

1 

2 

4 

Forced  Viterbi 
Word  Pair 

1.70(1.72) 
1.75  (2.19) 

0.47(0.78) 
0.47  (0.63) 

0.19(0.31) 

0.38(0.31) 

A  practical  pattern  recognition  concern  is  the  amount  of  training  data  for  model  rees¬ 
timation.  With  first  and  second  order  regression  coefficients,  each  speaker  is  represented  by 
21  three-state  monophone  models,  resulting  in  4914  output  density  parameters  per  speaker. 
Based  on  an  average  of  38,000  enrollment  observations,  the  ratio  of  training  patterns  to 
model  parameters  is  7.7.  To  increase  this  ratio,  feature  reduction  and  covariance  sharing 
were  performed.  Table  8  shows  that  reducing  the  model  size  by  removing  transitional  fea¬ 
tures  increases  error  rates,  and  sharing  covariance  matrices  among  individual  monophone 
states  shows  the  opposite  effect. 

The  best  identification  results  are  shown  in  Table  9  when  the  architecture  includes 
two-mixtures  per  state,  single  shared  diagonal  covariance  for  each  monophone  and  21 
monophones  per  speaker.  Features  include  transitional  information  by  incorporating  A  and 
AA  Mel  frequency  and  energy  coefficients.  Experimentally,  all  male  tests  were  correctly 
classified  when  prompting  four  combinations  as  a  test  trial.  The  inability  to  correctly 
identify  all  females  can  be  explained  by  Campbell  [17],  where  speaker  7^^240  used  a  “false 
voice.”  See  Appendix  5.E  for  the  typical  log-likelihood  of  these  four  test  utterances.  If  this 
session  were  removed,  all  females  would  correctly  classified,  as  well. 

5.5.4  False  Voice  Effects.  It  has  been  noted  by  Campbell  [17]  that  Speaker  #240, 
in  test  session  #969  used  a  “false”  voice  for  all  four  utterances.  This  effected  identification 
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Table  8.  Closed-Set  Speaker  Error  Rates  (%)  with/without  Shared  Covariance  using 
forced  Viterbi  decoding  for  1,2  and  4  combination  phrases.  Base  feature  is 
MFCC  -|-  Energy. 


Feature 

Males(Females),  E/state 

1 

2 

4 

Base-|-A  -1-  AA 

1.70(1.72) 

0.47(0.78) 

0.19(0.31) 

Base-fA 

2.52(2.34) 

0.99(0.94) 

0.57  (0.31) 

Base 

5.83(5.55) 

2.55(2.34) 

1.60(0.94) 

Feature 

Males(Females),  S/monophone 

1 

2 

4 

"h 

1.06(1.48) 

0.47(0.78) 

0.19(0.31) 

Base-I-A 

1.37(1.25) 

0.57(0.47) 

0.28(0.31) 

Base 

3.56(2.19) 

1.46(1.41) 

0.85(0.31) 

Table  9.  Closed-Set  Speaker  Error  Rates(%)  Using  Decreasing  Transitional  Features  and 
2  Mixtures  for  1,2  and  4  combination  phrases.  Base  feature  is  MFCC  +  Energy. 


Feature 

Males(Females),  S/monophone 

1 

2 

4 

Base-l-A  +  AA 
Base-fA 

Base 

0.92(1.48) 

1.16(1.02) 

3.09(2.03) 

0.28(0.78) 

0.52(0.47) 

1.04(0.78) 

0.00(0.31) 

0.19(0.31) 

0.66(0.31) 

(and  verification)  results  continually  by  misclassification  of  these  particular  trials.  Shown  in 
Figure  20  is  the  drop  by  several  orders  of  magnitude  in  the  normalized  log-likelihood  scores 
for  these  utterances  using  speaker  ^240’s  model.  Note  in  identification  (and  verification) 
results  for  females,  the  0.31%  result  is  a  consequence  of  these  utterances. 

5.6  Verification  Methodology 

The  procedure  of  speaker  verification  involves  some  method  of  comparing  the  test 
utterance  to  “relative  proximity”  of  the  claimed  speaker  model,  instead  of  simply  choosing 
the  maximum  score  for  speaker  identification  (Equation  57).  Often  some  threshold  needs  to 
be  specified,  either  globally  for  all  speakers  or  individual  thresholds  can  be  used.  Recently, 
a  proposal  to  use  cohort  speakers  provides  a  method  to  use  likelihood  ratios  as  a  basis  for 
verification  where  a  a  global  a  posteriori  threshold  will  be  examined  for  equal  error  rate 
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Figure  20.  False  voice  effects  of  speaker  240,  session  969,  shown  in  the  forced  Viterbi 
normalized  log-likelihood  scores  (*). 

analysis.  Examine  Figure  21  to  understand  the  reason  for  a  relative  threshold.  For  several 
test  utterances  of  a  male  speaker,  the  forced  Viterbi  log  likelihoods  are  plotted  using  the 
true  speaker’s  model  and  several  impostor  models.  The  log  ratios  vary  greatly  across  test 
utterances,  yet  each  models  appear  to  track  with  all  others.  Obviously,  poor  results  would 
be  observed  using  a  single,  fixed  threshold. 

5.6.1  Likelihood  Ratios.  The  likelihood  ratio  test  is  a  useful  tool  based  on 
Bayesian  analysis  for  performing  speaker  verification.  The  Bayes  error  rate,  a  statistical 
upper  bound  on  performance  of  any  pattern  classifier  [35,  110],  is  achieved  by  applying  the 
Bayes  decision  rule.  This  maximum  a  posteriori  (MAP)  approach,  given  utterance  U,  will 

choose  Ai  if,  p{\i\U)  >  p{\2\U) 

choose  A2  otherwise 
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Figure  21.  Typical  log-likelihood  of  true  model  and  impostor  models  shows  the  variability 
based  on  the  transcription  (prompted  words)  which  forces  some  non-fixed 
thresholding  scheme. 


or  by  using  Bayes  rule,  the  probability  density  functions,  either  known  or  approximated, 
can  be  used.  Taking  the  logarithm. 


log 


piU\Xi) 

p(U\X,) 


>T, 


where  T  =  log 


Speaker  verification  systems  are  then  based  on  this  log-likelihood  ratio  C  of  the  utterance 
(or  set  of  utterances)  by  applying  the  concept  to  a  claimed  model  (Ai)  against  not  the 
claimant  (A2). 


C{U) 


_  1  P^^\X  Xclaim^ 

®  piU\X  7^  Xclaim) 

logp(L{\X  =  Xgiaim)  ~  loSP(^l'^  ^  Xclaim^ 


(58) 


If  the  above  quantity  is  greater  than  the  threshold  T,  which  accounts  for  the  unknown 
speaker  prior  probabilities,  the  maximum  likelihood  decision  is  to  accept  the  utterance  U 
as  the  claimed  speaker.  We  seek  to  approximate  this  last  quantity  using  a  set  of  “close” 
reference  speakers,  as  suggested  by  Higgins  [51].  Campbell  establishes  methods  for  testing 
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on  YOHO  by  calling  these  reference  speakers  “cohorts”.  To  determine  “close”,  we  examine 
training  utterances  through  the  set  of  reference  models.  This  procedure  will  be  referred  to 
as  cohort  normalization  of  the  log-likelihood  ratio. 

Furui  [113]  discusses  several  measures  for  cohort  normalization,  each  a  potentional 
approximation  to  the  last  expression  of  the  log  likelihood  (Equation  58).  Some  of  these 
approximations  include  the  logarithm  of  the  summation  of  cohort  likelihoods  or  the  sum¬ 
mation  (average)  of  log  likelihoods  [80].  This  latter  geometric  mean  cohort  normalization 
method  was  used  for  these  experiments.  Specifically,  define  a  set  of  cohort  speakers  C  of 
size  |C|.  Then,  using  the  joint  likelihood  of  the  set  of  cohort  speakers,  the  log-likelihood 
ratio  is  given  by 

~  log  ~  ^claim 

=  logp{U\Xclaim)-'^Og-^Y^p{U\Xj).  (59) 

1^1  jec 

In  practice,  it  has  been  reported  this  can  be  further  approximated  by 

£(U)  w  logp(Z^|A<,iai„,)  -  ■i^]^logp(ZY|Aj).  (60) 

j€C 

If  the  last  expression  is  assumed  to  be  dominated  by  the  single  closest  reference  speaker, 
then  the  maximum  operator  can  also  be  used  for  normalization. 

5.6.2  Measure  of  HMM  Similarity.  Each  speaker  is  represented  by  21  speaker 
dependent  phoneme  models,  including  silence  and  interword  space.  All  96  enrollment  utter¬ 
ances  are  used  to  establish  first  and  second  order  statistics  for  the  Viterbi  log  likelihoods. 
Creating  cohort  sets  is  accomplished  in  one  of  three  ways,  each  a  sorted  list  of  “close” 
speakers.  Define  the  Difference  of  Means  log  ratio  as 

=  log--,  I  , 

piU\Xj) 

=  \ogp{U\Xi) -\ogp{U\Xj) 
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where  U  are  all  enrollment  utterances  for  speaker  i.  Reynolds  [104]  provides  a  Symmetric 
distortion  measure  between  two  models  using  enrollment  utterances  from  both  the  target 
and  the  potential  cohort  to  determine  similarity.  If  Ui,  Aj  represent  speaker  i  training 
observations  and  model  respectively,  then  a  symmetric  distortion  measure  can  be  defined 
as 


dsYM{\i  ^j) 


log 


pimi) 

pimj) 


+  log 


pimi) 

pimjY 


These  approaches  are  examples  of  a  first  order  statistical  analysis  of  the  output  dis¬ 
tributions.  Several  researcher’s  have  examined  the  issue  of  measuring  “distances”  between 
HMMs  [59]  for  measuring  model  similarity.  The  goal  then  is  to  search  for  the  set  of  cohort 
HMMs  which  are  close  to  the  claimant’s  HMM  in  some  probabilistic  distance.  If  enough 
training  sequences  from  each  speaker  are  evaluated  against  each  HMM,  a  distribution  of 
log  likelihoods  begins  to  form,  where  a  sample  mean  and  variance  can  be  extracted  [42]. 

Higher  order  statistics  can  be  used  in  conjunction  with  the  Bhattacharyya  distance 
for  measuring  the  separability  between  the  output  distributions  of  a  pair  of  HMMs  [46]. 
The  Bhattacharyya  distance  is  derived  from  an  analysis  of  determining  an  upper  bound  on 
the  Bayes  error  rate  of  a  two  class  problem.  The  form  of  this  distance,  for  the  1-dimensional 
log  likelihoods,  is 


1  {rrii 

4  a'i 


mj)'- 


1  f  ^ 

+  O  log  1  - - - 

^  V 


where  ruj  represents  the  enrollment  likelihood  mean  and  a?  represents  the  enrollment  like¬ 
lihood  variance.  The  first  term  is  a  measure  of  the  class  separability  due  to  the  difference 
in  the  means  while  the  second  term  is  a  measure  of  separability  due  to  the  variance  differ¬ 
ence.  Fielding  [42]  has  shown  how  the  use  of  second  order  statistics  can  be  useful  for  HMM 
model  comparisons.  The  Bhattacharyya  distance  applied  to  the  log  probability  statistics 
provides  a  unique  approach  to  log-likelihood  ratio  normalization. 
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5.7  Speaker  Verification  Results  on  YOHO 

To  avoid  statistical  dependence  between  phrases  within  each  verification  session,  all 
four  combination  phrases  are  taken  as  a  test  sample,  as  outlined  in  [17],  Results  are  also 
shown  when  this  dependence  assumption  is  not  made  and  all  utterances  (or  pairs)  are  each 
taken  as  a  sample.  The  standard  procedure  is  not  performing  inter-gender  tests  or  testing 
with  cohort  speakers. 

5.7.1  Vector  Quantization.  Verification  using  the  VQ  approach  makes  use  of  a 
similar  method  to  log  likelihood  ratios.  If  each  speaker  cluster  is  assumed  the  mean  of 
a  unity  variance  normal  density,  and  all  frames  are  independent,  then  the  negative  log- 
likelihood  of  a  test  utterance  is  proportional  to  the  overall  test  utterance  VQ  distortion. 
A  rank  ordering  of  close  speakers  or  “cohorts”  is  accomplished  by  the  simple  Difference  of 
Means,  the  second  order  Bhattacharyya  and  Symmetric  selection  strategies  using  negative 
distortion  for  the  log-likelihoods. 

These  cohorts  provide  a  reference  for  the  verification  system.  The  claimed  speaker 
model  distortion  is  normalized  by  the  average  distortion  of  his  or  her  closest  cohorts.  Equal 
Error  Rates  (EER)  for  each  of  the  three  cohort  normalization  methods  are  shown  in  Tables 
10  and  11  for  codebook  sizes  of  32  and  64  and  cohort  sizes  of  5  and  10.  Note,  very  little 
difference  in  cohort  selection  methods  is  evident. 

Table  10.  Speaker  Verification  Equal  Error  Rates  (%)  using  vector  quantization  overall 
distortion  for  1  and  4  combination  phrases.  Cohort  normalization  methods 
on  the  negative  distortion  include  Difference  of  Means,  Bhattacharyya,  and 
Symmetric  using  5  cohorts. 


Cohort  Normalization 

Males  (Females),  VQ  32 

Males  (Females),  VQ  64 

1 

4 

1 

4 

Difference  Of  Means 

Bhattacharyya 

Symmetric 

2.08(3.44) 

2.15(3.12) 

1.96(2.18) 

3.56(3.76) 

3.68(3.76) 

3.65(3.81) 

1.69(1.25) 

1.60(1.28) 

1.60(1.32) 

5.7.2  Phonemic  Frame  AR  Hidden  Filters.  While  identification  results  using 
frame  autoregressive  hidden  filters  performed  adequately,  the  method  applied  to  verifica- 
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Table  11.  Speaker  Verification  Equal  Error  Rates  (%)  using  Vector  Quantization  overall 
distortion  for  1  and  4  combination  phrases.  Cohort  normalization  methods 
on  the  negative  distortion  include  Difference  of  Means,  Bhattacharyya,  and 
Symmetric  using  10  cohorts. 


Cohort  Normalization 

Males  (Females),  VQ  32 

Males  (Females),  VQ  64 

1 

4 

1 

4 

Difference  of  Means 

Bhattacharyya 

Symmetric 

3.96  (4.92) 
4.09(  4.86) 
4.03  (4.22) 

Mm 

1.23  (0.96) 
1.13  (0.94) 
1.13  (0.94) 

tion  did  not  perform  as  well  as  VQ.  This  can  obviously  be  attributed  to  the  closeness  of 
impostor  and  true  claimant  log-likelihood  scores.  For  example,  using  four  combination 
lock  phrases  and  five  cohorts,  the  best  verification  equal  error  rates  were  4.68%,  4.68%  and 
4.33%  for  DOM,  Bhattacharyya,  and  Symmetric  cohort  selection  strategies,  respectively. 
These  results  were  based  the  best  frame  AR  hidden  filter  model  in  the  identification  tests 
-  3-state  left-to-right  monophones  with  12-th  order  filters. 

5.7.3  Vector  Hidden  Filters.  Tables  12  and  13  summarize  the  extensive  verifica¬ 
tion  test  undertaken  for  this  research.  For  each  of  the  features,  (MFCC-[-E,  MFCCH-E-I-A 
and  MFCC-hE-fAA),  a  complete  set  of  hidden  filter  Markov  models  (0-th  order)  were 
trained  for  each  speaker.  This  set  include  21  monophones  per  speaker,  with  each  mono¬ 
phone  model  consisting  of  a  3-state  left-to-right  two-mixture  hidden  Markov  model,  sharing 
a  single  diagonal  covariance. 

All  possible  male  (female)  test  utterances  were  applied  to  all  male  (female)  models, 
respectively.  The  tables  were  generated  using  the  approximation  to  the  log-likelihood  C 
defined  by  Equation  60  using  a  specific  ordered  set  of  cohorts  C.  This  ordered  set  was 
previously  determined  by  passing  all  enrollment  data  through  all  the  trained  models  and 
sorting  by  the  three  cohort  similarity  measures  -  d^oMi  dsvM  and  ds.  The  final  equal 
error  rate  (EER)  is  calculated  by  stepping  a  global  threshold  until  the  average  difference 
between  false  accepts  error  rates  and  false  reject  errors  converges.  Note  the  best  equal 
error  rate  is  found  using  the  full  39-dimensional  features  with  the  Bhattacharyya  cohort 
normalization  and  prompting  4  combination  locks  as  a  single  test  trial. 
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The  extensiveness  of  these  tests  is  further  clarified.  When  one  combination  lock 
phrase  is  tested  for  verification,  the  number  of  potential  false  reject  tests  (true  speaker 
claiming  him/herself)  is  4240  for  males  and  1280  for  females.  The  potential  false  accept 
tests  (impostors)  for  the  five  cohort  table  is  424,000  for  males  and  33,280  for  females. 


Table  12.  Speaker  Verification  Equal  Error  Rates  (%)  using  5  cohorts  based  on  2  mixture 
3-state  monophones.  Base  is  MFCC-I-E. 


Feature 

DOM  Males(Females) 

1 

2 

4 

BASE-l-A  +  AA 
BASE  +A 

BASE 

1.39  (1.89) 
1.58  (2.16) 
2.38  (3.14) 

0.89  (1.95) 
0.99  (1.25) 
1.50  (2.03) 

0.66  (0.93) 
0.74  (0.71) 
1.22  (1.25) 

Feature 

Bhattacharyya  Males 

[Females) 

1 

2 

4 

BASE  +A  +  AA 
BASE  +A 

BASE 

1.53  (1.89) 
1.70  (1.95) 
2.57  (2.98) 

0.89  (0.92) 
1.07  (1.09) 
1.55  (2.03) 

0.68  (0.63) 
0.83  (0.63) 
1.13  (0.94) 

Feature 

Symmetric  Males(Females) 

1 

2 

4 

BASE  +A  +  AA 
BASE  +A 

BASE 

1.37  (1.72) 
1.53  (1.79) 

2.38  (2.82) 

0.85  (0.78) 
0.90  (0.94) 
1.51  (1.89) 

0.57  (0.63) 
0.566  (0.60) 
1.03  (1.25) 

Figure  22  demonstrates  the  effectiveness  of  the  Bhattacharyya  distance  when  used  in 
conjunction  with  the  log-ratio  normalization  compared  to  other  cohort  selection  methods. 


5.8  Critical  Error  Analysis 

Higgins  [51],  and  more  recently  Campbell  [17],  has  examined  the  statistical  signifi¬ 
cance  of  the  YOHO  experiments  based  on  confidence  intervals.  This  section  presents  an 
alternative  method  using  hypothesis  test  analysis  at  the  highest  significance  levels  for  the 
amount  of  YOHO  data.  This  presentation  using  significance  levels  provides  a  straightfor¬ 
ward  approach  in  accepting  a  potential  speaker  verification  system.  The  technique  easily 
generalizes  to  any  pattern  recognition  problem  where  a  target  level  of  acceptability  is  pro¬ 
vided.  We  place  much  greater  emphasis  on  rejecting  potentially  unacceptable  systems 
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Table  13.  Speaker  Verification  Equal  Error  Rates  (%)  using  10  cohorts  based  on  2  mixture 
3-state  monophones.  Base  is  MFCC-I-E. 


Feature 

DOM  normalization  Males(Females) 

1 

2 

4 

BASE+A  +  A 
BASE  +A 

BASE 

0.94  (1.41) 
1.04  (1.41) 
1.72  (2.17) 

0.51  (0.63) 
0.56  (0.94) 
0.95  (1.12) 

0.38  (0.35) 

0.47  (0.31) 

0.75  (0.66) 

Feature 

Bhattacharyya  normalization  Males(Females) 

1 

2 

4 

BASE  -l-A  -h  A 
BASE  +A 

BASE 

0.92  (1.56) 
1.01  (1.56) 
1.79  (2.50) 

0.47  (0.63) 
0.66  (0.63) 
1.03  (1.41) 

0.21  (0.55) 

0.47  (0.31) 

0.56  (0.94) 

Feature 

Symmetric  normalization  Males(Females) 

1 

2 

4 

BASE  +A  +  A 
BASE  +A 

BASE 

0.97  (1.25) 
1.06  (1.39) 
1.84  (1.95) 

0.52  (0.51) 
0.56  (0.47) 
1.08  (0.97) 

0.38  (0.32) 

0.47  (0.31) 

0.68  (0.56) 

than  on  accepting  potentially  acceptable  ones.  For  the  speaker  verification  problem,  the 
consequences  of  a  wrong  decision  dictates  this  approach. 

Define  the  null  hypothesis,  Ho,  to  be  the  System  Error  Rate,  Ser,  does  not  meet  the 
Target  Error  Rate  Ter, 


Ho  :  Ser  >  Ter  UNACCEPTABLE 
iLi  :  Ser  <  Ter  ACCEPTABLE 


(61) 


Previously,  results  have  been  reported  at  the  75%  confidence  level  for  False  Acceptance 
and  False  Reject  target  values.  However,  this  method  would  pass  a  large  percentage  of 
systems  that  are  in  reality  unacceptable. 

The  main  concern  should  not  be  the  probability  of  meeting  the  Target  Error  Rate, 
which  a  confidence  level  analysis  provides;  the  main  concern  should  be  in  the  decision 
to  reject  potential  candidates  taking  into  account  the  consequences  of  a  wrong  decision. 
Conjecture  all  systems  are  unacceptable  and  allow  the  experimental  evidence  (observed 
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Figure  22.  Speaker  Verification  False  Accept  and  False  Reject  Error  Rates  (%)  using 
DOM,  Bhattacharyya  and  Symmetric  cohort  selection  strategies.  Results 
show  the  effect  of  an  unseen  threshold.  Data  only  used  male  speakers,  when 
prompted  with  4  combination  lock  phrases  and  normalized  with  10  cohorts 
using  full  39-dimensional  features  (MFCC-|-E-t-A  +  AA).  Best  Equal  Error 
Rate  shown  is  0.21%  using  the  Bhattacharyya  normalization.  Note:  (*)  de¬ 
notes  U.S.  Government  requirement  of  1%  FR  and  0.1%  FA  [18,  17]. 


errors)  to  reject  this  conjecture  [89].  One  can  also  examine  the  probability  of  failing 
acceptable  systems,  but  this  is  a  secondary  concern. 

5.8.1  Statistical  Assumptions.  Many  times,  we  perform  a  set  of  tests  and  report 
average  results,  typically  with  confidence  intervals.  Tests  can  average  over  several  random 
initial  experiment  setups  (Monte  Carlo  Confidence  Interval)  or  averages  can  be  based  on 
the  number  of  total  test  observations  (Classifier  Confidence  Interval)  [111].  However,  if  a 
target  error  rate  is  specified,  then  instead  of  bounding  the  results,  one  needs  to  specify 
how  confident  we  are  of  meeting  or  exceeding  this  specification. 

Ruck  [111]  reviews  the  approach  and  the  procedures  for  both  Monte  Carlo  and  Classi¬ 
fier  confidence  intervals.  Each  independent  recognition  trial  is  a  Bernoulli  random  variable, 
taking  values  0  and  1  if  the  verification  or  identification  was  correct  or  incorrect,  respec- 
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tively.  Prom  elementary  probability,  the  sum  of  Bernoulli  random  variables  takes  on  a 
binomial  distribution,  thus  the  total  number  correct  (or  incorrect)  is  a  binomial  random 
variable.  Under  certain  conditions,  a  Poisson  or  normal  random  variable  may  be  used  to 
approximate  the  binomial  and  easily  specify  confidence  intervals. 

Let  X  be  the  total  number  of  errors  -  a  random  variable.  Given  n  independent  tests 
with  a  p  probability  of  error,  then  the  binomial  distribution  is 

piX  =  x-,n,p)  =  J2i 

i=0  k 

which  is  the  probability  of  observing  x  total  errors.  Suppose  we  observe  x  errors  on  the  n 
tests  performed.  Our  point  estimate  for  p  is  x/n.  However,  a  better  method  of  specifying 
the  true  error  probability  p  is  to  bound  it  at  the  7  =  95%  or  99%  confidence  interval.  The 
boundary  values  (random  variables)  we  seek  are  pi  and  pu  such  that 


PiPL  <  P  <  Ph)  =  1- 


Since  n  is  exceedingly  large  for  YOHO  experiments,  an  approximation  to  the  binomial 
proves  efficient.  It  has  been  noted  that  X  is  approximately  normal  when  n  is  large  with 
mean  np  and  variance  pqn,  q  =  1  —  p.  Hoel  [52]  provides  some  experimental  insight,  in 
that  this  approximation  is  valid  when  np  >  5,  p  <  .5,  nq  >  5  and  q  >  .5.  Small  values  ofp 
with  “moderately”  large  n  would  skew  the  distribution,  and  thus  the  following  summary 
holds  in  practice: 

n  small  — >  Use  Binomial, 

n  large,  p  large/small  — >  Use  Poisson, 
n  large,  p  moderate  —y  Use  Normal. 

Using  the  Poisson  approximation  to  the  binomial  (which  is  good  for  error  rates  less 
than  5%  and  number  of  trials  greater  than  100),  critical  error  curves  [51]  are  drawn  based 
on  the  hypothesis  test  formulation  in  Equation  62  using  the  most  stringent  significance 
levels,  Figures  23.  Following  Higgins  [51], 
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Definition  V.l  ( Critical  Error)  The  Critical  Error  is  the  maximum  number  of  errors  able 
to  be  observed  before  rejecting  the  recognition  system  at  a  given  significance  level. 


The  probabilities  of  accepting  a  system  for  various  critical  errors  is  given  in  Figure  24. 
Using  these  graphs  allows  recalculation  of  critical  errors  for  YOHO  in  Table  14  and  Table 
15. 


Figure  23.  Critical  Errors  for  Tests  Designed  at  the  5%  and  25%.  The  curve  is  generated 
by  searching  for  the  appropriate  A  value  given  a  particular  (discrete)  Critical 
Error.  The  curve  is  used  by  knowing  A  =  N  •  Ter,  the  Target  Error  Rate 
(Ter)  and  the  size  of  the  database  N,  then  reading  over  and  down. 


We  review  the  creation  of  these  graphs,  since  their  full  understanding  can  lead  to 
applications  elsewhere.  For  our  application,  these  graphs  provide  the  maximum  number 
of  critical  errors  able  to  be  seen  while  still  satisfying  the  target  error  rate.  However,  one 
could  use  this  analysis  for  sizing  a  particular  database  by  pre-specifying  the  critical  errors. 
The  Poisson  distribution  has  been  chosen  in  our  case  since:  1)  n  can  be  up  to  4240  for 
identification  and  over  110,000  for  verification  and  2)  the  error  rates  are  specified  at  1% 
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Figure  24.  Probability  of  Rejecting  Hq  -  Accepting  the  System  Meets  the  Target  Error 
Rate  Ter,  for  number  of  critical  errors  (0,2, 4, 6, 8, 10, 12, and  14)  at  the  5% 
significance  level. 


and  0.1%: 


X 

Poisson  distribution:  p{x;  A)  =  ^ 

k=o 


k\ 


As  can  be 


seen,  only  the  single  parameter  A  specifies  this  distribution  and  subsequently  its 


mean  and  variance. 


5.8.2  Application  of  Hypothesis  Test.  Table  14  provides  critical  errors  (CE)  for 
particular  FA  and  FR  target  error  rates.  Since  the  number  of  false  rejects  is  limited  (4,240 
for  males  and  1,280  for  females)  one  cannot  report  results  at  the  5%  significance  level,  and 
the  entries  for  False  Accept  are  provided  at  the  25%  significance  level  [17].  Also,  we  chose 
to  use  all  impostor  tests  available,  counting  each  session  as  statistically  independent.  This 
amounts  to  total  false  acceptance  (FA)  tests  of  106,000,  and  100,700  based  on  the  number 
of  cohorts  (5  and  10)  respectively.  The  rationale  for  this  decision  is  based  on  allowing  more 
than  one  session  for  false  reject  testing  and  counting  those  as  independent. 
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Table  14.  Critical  Errors  (CE)  at  the  5%  and  25%  Significance  level  (Sigf)  for  the  U.S 
government  Required  and  Goal  Target  Error  Rates  (Target).  Shown  separately 
for  False  Accept  (FA  -  based  on  5  cohorts)  and  False  Reject  (FR)  tests.  Table  14 
provides  approximate  readings  from  Figure  24,  where  the  ratio  of  hypothesized 
Ser/Ter  =  e  and  the  Probability  of  Accepting  the  System  is  denoted  Ppass. 
Sizes  is  attainable  with  the  YOHO  database. 


Test 

Target 

Sigf 

Ppass 

e 

Size 

CE 

FR 

70% 

2/3 

1,080 

8 

FR 

0.1% 

25% 

1/2 

1,386 

0 

FA 

99% 

2/3 

105,065 

88 

FA 

0.01% 

5% 

57% 

1/2 

105,131 

FA 

2/3 

105,517 

98 

FA 

0.01% 

25% 

88% 

1/2 

96,845 

7 

Table  15.  Critical  Errors  (CE)  at  the  5%  for  the  U.S  government  Required  and  Goal  Tar¬ 
get  Error  Rates  (Target).  Shown  for  False  Accept  (FA  -  based  on  10  cohorts). 
All  other  columns  described  in  Table  14. 


Test 

Target 

Sigf 

Ppass 

e 

Size 

CE 

98% 

2/3 

100,700 

84 

FA 

0.01% 

5% 

99% 

1/2 

91,535 

4 

FA 

2/3 

100,345 

93 

FA 

00 

00 

1/2 

96,845 

7 

For  example,  in  order  to  pass  a  system  at  the  5%  significance  level  with  a  Target  of 
0.1%  False  Acceptance  Rate,  one  must  achieve  less  than  or  equal  to  88  errors  in  105,065 
impostor  tests.  In  addition,  if  we  think  our  system  is  twice  as  good  as  the  target,  e  =  1/2, 
then  we  have  a  99%  probability  of  accepting  the  system.  An  interesting  conclusion  to  this 
analysis  will  be  demonstrated  through  Figure  25.  Similar  to  Figure  22,  though  instead  of 
percent  errors,  actual  counts  are  plotted  as  an  unseen  threshold  is  varied.  The  denotes 
FA  and  FR  critical  errors  at  5%  and  25%  significance  level.  While  Figure  22  appears 
to  indicate  that  all  three  cohort  normalization  methods  passed  within  the  specified  U.S. 
Government  target,  in  actually,  when  the  hypothesis  test  (Equation  62)  is  used  with  a 
specified  significance  level,  only  one  method  would  actually  be  accepted. 
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Due  to  the  great  imbalance  of  impostor  tests,  we  can  make  a  much  stronger  statement 


by  mixing  the  significance  levels  as  follows. 


The  speaker  dependent  phoneme  system,  based  on  cohort  selection  using  the  symmetric 
score  passes  the  0.1%  False  Acceptance  target  rate  at  the  5%  significance  level,  while 
passing  the  1%  False  Reject  target  rate  at  the  25%  significance  level. 


Figure  25.  Speaker  Verification  False  Accept  and  False  Reject  Error  (#)  using  DOM, 
Bhattacharyya  and  Symmetric  cohort  selection  strategies.  Results  show  the 
effect  of  an  unseen  threshold.  Data  only  used  male  speakers,  when  prompted 
with  4  combination  lock  phrases  and  normalized  with  10  cohorts  using  full 
39-dimensional  features  (MFCC-I-E-I-A  +  AA). 


5. 9  Conclusion 

This  chapter  demonstrated  several  new  findings  concerning  the  ability  to  model  and 
subsequently  identify  or  verify  speakers  based  on  the  acoustic  signal.  First,  it  was  demon¬ 
strated  that  vector  quantization,  a  reliable  and  proven  method,  provides  similar  perfor¬ 
mance  to  the  Poritz,  frame  autoregressive  model.  Whereas  both  have  about  equal  param- 


eters,  the  hidden  filter  approach  can  also  hypothesize  the  word-string  spoken.  Though 
the  vector  hidden  filters  method  using  the  baseline  Mel  frequency  cepstral  representation 
showed  better  modeling  with  increased  filter  orders  (Appendix  5.B),  they  did  not  provide 
any  classification  usefulness.  One  plausible  explanation  relates  to  the  strict  stationarity  of 
the  cepstral  coefficients  as  hypothesized  in  Appendix  5.C.  This  manifests  into  a  trivial  filter 
across  all  phonemes  and  all  speaker.  Another  explanation  may  relate  to  the  dilemma  re¬ 
lated  to  classifying  signals  based  on  prediction.  Often  the  better  the  prediction  of  training 
data,  the  less  generalization  occurs  during  test. 

By  using  a  0-th  order  filter,  which  models  states  as  noisy  constant  functions,  statisti¬ 
cally  significant  improvements  were  demonstrated  over  vector  quantization.  The  addition 
of  transitional  coefficients  and  the  addition  of  several  prompted  phrases  monotonically  de¬ 
crease  errors.  A  shared  covariance  used  across  the  phoneme  states  also  decreased  errors 
for  identification,  probably  due  to  the  limitation  on  enrollment  data.  Best  results  of  100% 
identification  on  both  male  and  female^  were  demonstrated  using  two  mixture  0-th  order 
filters. 

Log  ratios  and  log  ratio  normalization  using  cohorts  were  introduced.  Three  meth¬ 
ods  of  selecting  close  speakers  were  examined,  with  the  Bhattacharyya  distance,  a  new 
approach  developed  within  this  research  which  includes  second  order  statistics,  was  shown 
the  optimal  selection  scheme  when  equal  error  rate  (EER)  is  the  benchmark.  A  more 
significant  critical  error  analysis  was  developed  to  specify  the  maximum  errors  allowed 
while  still  achieving  a  specified  target  error  rate.  This  analysis  was  applied  to  the  YOHO 
database.  The  noteworthy  conclusion  of  this  section  included  the  ability  to  easily  make 
a  claim  of  meeting  requirements  based  on  the  maximum  number  of  errors  seen  during 
testing. 

Comparison  to  Campbell’s  synopsis  [18]  of  recent  known  results  demonstrates  the 
effectiveness  of  these  approaches,  see  Table  16.  These  tests  were  designed  to  model  context 
and  coarticulation  at  the  subword  (phoneme)  level  for  speaker  modeling.  Historical  insight 
dictated  that  both  the  physiology  of  a  speaker  and  the  neural  habits  and  patterns  together 

^100%  based  on  removing  the  false  voice  session  of  speaker  240. 
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Table  16.  Recent  LDC  YOHO  Database  Results  [17]. 


Reference 

Verification  EER  (%) 

Identification  Error  (%) 

ITT  NN 

0.5 

ITT  CSR 

1.7 

MIT/LL  GMM 

0.51  (0.2m,  1.5f) 

0.8  (0.3m,  2.2f) 

Rutgers’  NTN 

0.65 

Rutgers’  HMM 

1.36 

Rutgers’  LVQ 

0.36 

COLOMBI 

0.21m,  0.31f 

0.0m,  0.31f 

diflferentiate  speakers.  Both  these  unique  traits  alter  the  dynamics  of  the  acoustic  signal 
which  we  have  successfully  modeled  with  a  hidden  Markov  architecture. 


VI.  Recommendations  and  Conclusions 


6.1  Recommendations 

During  the  course  of  good  research,  several  avenues  often  arise  which  are  not  taken. 

Two  roads  diverged  in  a  wood,  and  I  - 
I  took  the  one  less  traveled  by. 

And  that  has  made  all  the  difference. 

Robert  Frost,  1915 

This  section  addresses  those  areas  which  will  have  great  potential  for  robust  dynamic 
time  series  modeling  or  applications  in  speaker  recognition.  We  recommend  the  following 
research  areas,  in  order  of  importance: 

Speaker  Verification  Normalization.  Normalization  of  the  likelihood 
scores  provides  orders  of  magnitudes  improvement  over  fixed  thresholds.  The  fundamental 
reason  for  their  requirement  lies  in  the  overall  likelihood  score  containing  much  more  than 
speaker  contributions.  As  evident  from  Figure  21,  the  variability  in  the  log-likelihood  scores 
reflects  word  sequence  and  ordering,  phonemic  content  and  other  language  phenomena  - 
all  not  related  to  speaker  verification.  Basic  research  in  removing  language  and  grammar 
effects,  which  are  present  in  current  speaker  models,  would  be  significant  to  future  systems 
concerning  speaker  authentication  and  speaker  adaptation. 

NonCausal  Filters.  It  has  been  observed  that  the  human  body  ap¬ 
pears  to  be  a  multichannel,  noncausal  processing  machine  [109].  Multichannel  refers  to 
the  several  human  sensors,  all  coherently  merged  in  our  billions  of  neurons  to  form  a  sin¬ 
gle  consistent  “world  model”.  Various  noncausal  capabilities  have  been  observed  in  our 
perception  of  time  and  sound  (see  auditory  illusion  of  phoneme  restoration  by  Warren  and 
Warren  [127,  128]).  This  research  has  focused  exclusively  on  causal  filters.  In  the  signal 
processing  and  modeling  literature  there  has  been  recent  interest  in  Two  Sided  Predic¬ 
tion  (TSP)  and  other  noncausal  approaches.  These  models,  for  classification,  should  be 
examined  for  providing  better  forward  and  backward  context,  potentially  within  a  hidden 
Markov  model. 


94 


Prediction  and  Classification.  The  need  for  theoretic  relations  be¬ 
tween  the  accurate  ability  to  classify  sequences  and  their  prediction  needs  to  be  addressed. 
Under  certain  assumptions,  Levin  has  shown  for  the  nonlinear  Markov-modulated  dynamic 
systems,  there  exists  a  direct  relation  between  reducing  mean  squared  error  in  the  training 
set  and  the  overall  likelihood  of  the  training  set.  However,  an  investigation  should  be 
conducted  relating  optimal  prediction  to  optimal  classification. 

Discriminant  Models.  The  theory  behind  Maximal  Mutual  Informa¬ 
tion  (MMI)  [16]  attempts  to  not  model  the  maximum  likelihood  estimates  of  parameters, 
such  as  Baum- Welch  achieves.  Instead,  since  the  correct  source  models  will  never  be  known, 
training  methods  should  be  discriminative  and  speaker  models  should  be  trained  in  con¬ 
junction  with  all  others  for  optimal  discrimination.  While  this  may  be  optimal  in  terms  of 
the  least  amount  of  needed  assumptions,  the  amount  of  training  data  and  the  ability  to  add 
new  classes  make  this  technique  inefficient.  Methods  should  be  researched  which  weight 
discrimination  to  the  ability  to  add  new  classes  (speakers)  and  to  estimate  parameters 
efficiently.  Naturally,  discriminative  hidden  filter  models  should  also  be  investigated. 

Another  related  research  area  lies  the  effective  use  of  artificial  neural  network  tech¬ 
nologies.  As  presented  in  the  framework  of  general  hidden  filters  (Chapter  II),  nonlinear 
and  discriminative  techniques  have  recently  been  examined  for  output  density  estimation, 
within  hidden  Markov  models.  However,  they  have  often  involved  large  “stupid”  neural 
networks  and  required  specialized,  fast  hardware  for  training.  Continued  work  in  new 
neural  architectures  and  their  placement  in  the  hidden  Markov  architecture  should  be 
performed. 

6.2  Contributions 

A  number  of  original  research  contributions  have  been  provided. 

Generalized  Hidden  Filter  Architecture.  A  complete  framework  for 
many  existing  linear  and  nonlinear  systems  used  for  classification,  as  well  as  prediction, 
was  developed  for  discrete  state  Markov  models.  The  existing  hidden  Markov  model  inde- 
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pendence  assumptions  were  reviewed  and  removed,  which  defined  a  new,  more  generalized, 
hidden  filter  Markov  model. 

AR  and  ARM  A  hidden  filters.  New  reestimation  methods  are  provided 
for  autoregressive  (AR)  and  autoregressive  moving  average  (ARMA)  as  well  as  an  optimal 
initialization  strategy.  The  ability  to  reestimate  these  filters  adequately  for  the  difficult 
ergodic  case  is  novel  and  shown  by  example.  The  new  ARMA  Markov  modulated  hidden 
filters  are  applicable  to  specific  broad  classes  of  phonemes,  with  a  spectral  zero  component. 
An  extension  to  frame  autoregressive  hidden  filters  was  proposed  for  accurate  phoneme 
modeling  and  applied  to  speaker  recognition. 

Vector  Autoregressive  Hidden  Filters.  The  extension  from  sample  or 
frame  based  filters  to  full  vector  autoregressive  hidden  filters  was  developed  using  an  emit- 
on-state  notation.  Full  and  diagonal  regression  variations  were  developed.  The  choice  of 
spectral  features  used  in  this  research,  the  Mel  frequency  cepstral  coefficients,  dictated 
a  diagonal  predictor  and  noise  model.  A  procedure  of  a  posteriori  mean  removal  was 
developed  to  separate  the  state  mean  estimation  from  the  filter  coefficients  for  numerical 
stability. 


HMM  and  Hidden  Filter  Convergence.  A  new  proof  of  monotonic  con¬ 
vergence  for  Gaussian  mixtures  was  presented  using  an  equivalence  model  paradigm.  A  new 
proof  of  monotonic  convergence  for  hidden  filter  Markov  models  was  then  demonstrated. 
An  application  of  the  Markov  property  of  the  observations  for  hidden  filter  models  was 
applied  to  the  Fielding  [42]  information  theoretic  proof.  Since  pattern  recognition  methods 
seek  ways  which  reduce  entropy  (and  reduce  classification  errors),  this  proof  justified  the 
hidden  filter  model  over  standard  hidden  Markov  models. 

Phonetic  Modeling  for  Speaker  Recognition.  A  speaker  dependent 
phoneme-based  hidden  Markov  model  system  was  accomplished  for  both  speaker  iden¬ 
tification  and  verification  using  the  extensive  YOHO  database.  State-of-the-art  speech 
recognition  tools  were  incorporated  into  the  system  such  as  phonetic  labeling,  word  dictio¬ 
naries,  bi-word  language  models  and  Viterbi  scoring  constraints.  The  left-to-right  3-state 
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phoneme  models  were  analyzed  exclusively.  The  use  of  ergodic  structures  was  hypothesized 
and  demonstrated  as  modeling  language  effects  and  thus  dictated  the  choice  of  left-to-right 
monophone  models.  The  method  of  forced  Viterbi  decoding  of  phoneme  based  temporal 
models  for  speaker  verification  was  shown  optimal,  with  a  theoretic  explanation  demon¬ 
strated  with  language  entropy.  A  novel  approach  to  correct  hidden  filter  model  order  using 
penalty  functions  was  reported  indicating  a  monophone  dependent  filter  order.  However, 
the  optimal  hidden  filter  used  for  all  tests  happened  to  be  the  0-th  order  hidden  Markov 
model  -  a  hypothesis  concerning  the  strict  sense  stationarity  of  speech  was  offered  to  explain 
this  effect.  A  new  second  order  metric  for  cohort  selection  was  developed  and  shown  to 
provide  the  best  equal  error  rate  of  0.21%  on  YOHO  males  and  0.31  on  females.  A  critical 
error  analysis  is  provided  for  YOHO  using  a  hypothesis  test  technique  which  demonstrated 
the  importance  of  comparing  results  to  a  test  statistic. 

6.3  Conclusions 

A  complete  system  framework  for  hidden  filter  Markov  models  has  been  developed 
and  applied  to  the  speaker  recognition  problem.  This  research  proposed  theoretical  ex¬ 
tensions  to  a  class  of  stochastic  models  and  demonstrated  their  effectiveness  on  the  prob¬ 
lem  of  text-independent  (constrained)  speaker  recognition.  Analysis  concerning  multiple 
mixtures  and  hidden  filter  models  guarantee  monotonically  increasing  likelihoods  during 
learning.  Using  information  theory,  the  hidden  filter  Markov  models  were  demonstrated 
optimal  over  hidden  Markov  models  for  pattern  recognition  problems.  Both  closed  set 
identification  and  normalized  likelihood  ratio  verification  using  cohorts  were  performed  on 
the  extensive  YOHO  database.  Perfect  identification  for  males  and  females  was  possible 
prompting  four  combination  lock  phrases.  Equal  error  rates  of  0.21%  males  and  0.31%, 
females  was  accomplished  using  a  forced  Viterbi  scoring  and  cohort  normalization  incorpo¬ 
rating  a  newly  developed  Bhattacharyya  distance  metric.  Where  other  researchers  report 
equal  error  rates,  this  research  demonstrated  the  importance  of  a  critical  error  analysis, 
basing  acceptance  on  the  number  of  critical  errors  -  found  using  a  hypothesis  test  tech¬ 
nique.  We  feel  this  document  advances  the  state-of-the-art  in  areas  of  Markov  modulated 
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dynamic  systems,  and  their  properties,  log-likelihood  normalization,  and  speaker  authen¬ 
tication/  verification  techniques. 

Many  new  applications  will  require  speech-based  biometric  recognition  such  as  secure 
access  control,  telephone-based  recognition,  transaction  and  credit  account  verification, 
forensic  science,  law  enforcement  and  military  intelligence  gathering.  Successful  methods, 
such  as  demonstrated  by  this  research,  provide  excellent  results  of  only  2  errors  in  1000 
attempts.  The  theoretic  contributions  clearly  demonstrate  the  efficiency  of  training  these 
models  and  their  justifiable  use  over  existing  techniques  for  many  pattern  recognition 
problems,  beyond  speaker  recognition.  Insights  from  world-class  researchers,  suggesting 
the  importance  of  dynamic  modeling  of  phonemes,  have  contributed  to  make  this  research 
state-of-the-art  in  the  challenging  field  of  speaker  recognition. 
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Appendix  A.  Induction  Derivation  of  the  Forward-Backward  Variables 

The  following  provides  the  inductive  calculation  of  the  forward  and  backward  vari¬ 
ables.  Initial  condition: 

0!i{i)  =  p{Oi,qi=i\X) 

=  p{Oi\qi  =  i,X)piqi  =  *|A)  =  bi{Oi)7Ti 

which  is  valid  for  1  <  i  Given  at,  now  find  ctt+i: 

"t+iO')  =  P(Oi  •••Oi+i,gt+i  ==  j|A) 

=  P(Ot+i\Ot,...,Ot,qt+i  =j,\)p{Oi,...,Ot,qt+i  =i|A) 

=  bj{Ot+i)p(Oi, . . .  ,Ot,qt+i  =i|A) 


Expanding, 


N 

p{Oi,...,Ot,qt+i=j\X)  =  ^p(Oi,...,Ot,gt+i  =  =  i|A) 

i=l 

N 

=  ^  j\Oi,...,Ot,qt  =i,X)p{Oi,...,Ot,qt  =i|A) 

i=l 

N 

^  ]aij(Xtii) 

i=l 


Hence, 


N 

at+i(i)  =  bjiOt+i)Y^aijat{i) 

t=i 


and  this  is  valid  for  1  <  t  <  T  —  1  and  1  <  j  <  N.  For  t  =  T  the  total  probability  is  given 


as 


p{Oi .  ■  ■  Ot\X) 


N 

—  y~!p(Oi  •  •  •  Ot,  gr  =  •iilA) 

i=l 

=  XI 

i=l 
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For  the  backward  variable,  let  the  initial  condition  be 

h{^)  —  p{Ot+i  •  •  •  OT\qt  =  A)  =  1 
for  1  <  z  <  A^.  Given  /3t+i,  now  find  (St 
=  piOt+i . . .  Orkt  =  i,  A) 

N 

=  i+i  •  •  •  OT,qt+i  —  j\qt  —  A) 

==  '^PiOt+i  ■  ■  ■  Orlqt+i  =  j,  qt  =  h  A)p(^t+i  =  j\qt  =  i,  A) 

3 

=  aijP(Ot+i\Ot+2  ■■■Ot,  qt+i  =  j,  qt  =  i,  A)p(Oj+2  . . .  Orkt+i  =  j,  qt  =  h  A) 

3 

=  ^  aijbj{Ot+i)p(Ot+2  ■  ■  ■  Orlqt+i  =  j,  qt  =  h  A) 

3 

Note  Ot+2 .  ■  ■  Ox  is  independent  of  =  z  by  Markov  property 


•  •  •  Oxlqt+i  —  i)  A) 

3 

=  X^aiz^j(Oz+i)/3t+iO') 

3 

Hence, 

N 

A(0  ^Y^i3hiOt+l)Pt+lij) 

3=1 

for  t  =  r  —  1,  r  —  2, . . . ,  1  and  1  <  z  <  iV.  For  t  =  1  the  total  probability  is  calculated  as: 


p(Oi . . .  Ot|A) 


N 

Yp(^i---^T,qi  =  i|A) 

i=l 

Ypi^i  ■  ■  ■  =  i,  A)p(gi  =  z|A) 

i 

^7rjp(0i|02---0T,gi  =*,A)p(02---0Tki  =bA) 

i 

N 

^7ri6i(Oi)^i(z) 
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Appendix  B.  Phonetic  Listing  With  Examples 


The  following  Table  17  provides  the  list  of  phoneme  with  examples.  Each  phoneme 
will  be  represented  by  a  3-state  left-to-right  hidden  filter  Markov  model  trained  separately 
for  each  speaker.  Table  18  provides  the  TIMIT  language  grammar  used  in  all  experiments 
and  two  additional  grammars  -  a  grammar  from  the  Resource  Management  (RM)  database 
[134]  and  an  optional  mixture  of  RM  and  TIMIT.  The  examination  of  these  grammars 
provide  dictionaries  in  which  clear  read  text,  conversational  speech  or  some  combination  of 
the  two  will  be  observed.  After  initial  experiments,  recognition  results  were  not  significantly 
different  and  the  TIMIT  grammar  was  used  for  the  remaining  experiments. 


Table  17.  Partial  Phonetic  List  from  Parsons  [90]  Applied  to  Digits. 


Arpabet 

Digits 

Arpabet 

Example 

Digits 

AH 

bud 

one 

wow 

one 

AX 

ahead 

seven 

noon 

nine 

AO 

hawed 

four 

tug 

two 

AY 

hide 

nine 

kick 

six 

EH 

head 

seven 

TH 

thick 

three 

EY 

hayed 

eight 

F 

fife 

four 

ER 

heard 

thirty 

S 

cease 

six 

IH 

hid 

six 

R 

roar 

four 

lY 

heed 

three 

V 

verve 

seven 

UW 

who’d 

two 

(DX) 

batter 

forty 
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Table  18.  YOHO  word  grammar.  [  ]  denotes  optional  monophone,  and  |  denotes  dual 
path  through  the  word  grammar.  These  grammars  get  used  to  expand  a  tran¬ 
scription  into  the  network  of  subword  models  for  forced  Viterbi  decoding  or  to 
provide  standard  Viterbi  syntax  for  automatic  speech  recognition. 


Word 

YOHO  Monophone  Grammar 

Source 

one 

W  AH  N  [sp] 

TIMIT,RM 

two 

T  UW  [sp] 

TIMIT,RM 

three 

TH  R  lY  [sp] 

TIMIT 

four 

F  AO  R  [sp] 

TIMIT 

five 

F  AY  V  [sp] 

TIMIT 

six 

S  IH  K  S  [sp] 

TIMIT 

seven 

S  EH  V  AX  N  [sp] 

TIMIT 

nine 

N  AY  N  [sp] 

TIMIT 

twenty 

T  W  EH  N  T  lY  [sp] 

TIMIT 

T  W  EH  N  lY  [sp] 

RM 

T  W  EH  N  [T]  lY  [sp] 

OPTION 

thirty 

TH  ER  T  lY  [sp] 

TIMIT 

TH  ER  DX  lY  [sp] 

RM 

TH  ER  DX  |T  lY  [sp] 

OPTION 

forty 

F  AO  R  T  lY  [sp] 

TIMIT 

F  AO  R  DX  lY  [sp] 

RM 

F  AO  R  DX  ]T  lY  [sp] 

OPTION 

fifty 

F  IH  F  T  lY  [sp] 

TIMIT, RM 

sixty 

S  IH  K  X  T  lY  [sp] 

TIMIT, RM 

seventy 

S  EH  V  AX  N  [T]  lY  [sp] 

TIMIT 

S  EH  V  AX  N  T  lY  [sp] 

RM 

S  EH  V  AX  N  [T]  lY  [sp] 

OPTION 

eighty 

EY  T  lY  [sp] 

TIMIT 

EY  DX  lY  [sp] 

RM 

EY  DX|T  lY  [sp] 

OPTION 

ninety 

N  AY  N  T  lY  [sp] 

TIMIT 

N  AY  N  lY  [sp] 

RM 

N  AY  N  [T]  lY  [sp] 

OPTION 
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Appendix  C.  Penalty  Functions  for  Order  Identification 

For  autoregressive  models,  both  unidimensional  and  vector  processes,  several  penalty 
function  methods  exist  for  determining  proper  model  order.  Several  of  these  base  their 
optimality  criterion  on  some  combination  of  the  error  variance  and  free  parameters,  yet 
are  derived  from  the  Kullback-Liebler  distance  between  a  model  PDF  and  the  true  PDF 
of  the  data.  Since  it  can  be  shown  that  error  variance  will  monotonically  decrease  with 
model  order,  a  penalty  term  is  added  to  prohibit  excessively  large  order  models.  This  is 
the  concept  behind  parsimonious  models  -  ones  with  as  few  parameters  as  possible.  Sev¬ 
eral  methods  include  the  Akaike  Information  Criterion  (AIC),  the  Final  Prediction  Error 
(FPE),  Parzen’s  Criterion  of  AR  Transfer  (CAT)  function  and  the  Bayesian  Information 
Criterion  (BIC)  [20,  21,  63,  94,  54].  The  AIC  is  often  chosen  for  small  data  samples  and 
both  the  AIC  and  FPE  converge  to  the  same  solution  as  the  number  of  samples  increase  [63]. 
The  extension  of  these  penalty  functions  has  not  been  explored  for  Markov  models,  yet  will 
be  needed  for  further  investigations  of  their  usefulness.  The  Akaike  Information  Criterion 
is  defined  as 

AlCip)  =  Tlogal  +  2p 

where  p  is  the  model  order,  is  the  MLE  of  the  noise  variance  and  T  is  the  total  num¬ 
ber  of  observations.  This  penalty  function  has  been  extended  to  multidimensional  vector 
autoregressive  processes  [21]  and  its  properties  continually  evaluated  [21,  126].  For  a  d- 
dimensional  vector  process  the  AIC  is  further  defined  as 

AIC(p)  =  riog|Sp|  +  2d2p 

where  similarly,  the  Sp  is  the  MLE  covariance  for  a  Vector  AR(p)  process.  Given  an  N- 
state  hidden  filter  Markov  model,  these  penalty  functions  could  be  extended  to  sum  the 
AIC  associated  with  each  state.  This  would  imply  the  following  functional  forms: 

•  Full  Predictor,  Pull  Covariance:  AIC(p,  iV)  =  |Sp(i)]  -|-  2d‘^p 

•  Diag  Predictor,  Full  Covariance:  AlC{p,N)  =  Tlog  |I!p(f)|  -f  2dp 

•  Full  Predictor,  Diag  Covariance:  AIC(p,  iV)  =  Tlogirace(Sp(z))  -|-  2(Pp 
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•  Diag  Predictor,  Diag  Covariance:  AIC(p,  iV)  =  T  log  trace{'Ep{i))  +  2dp. 

For  the  simple  case  of  single  state  hidden  filter  phoneme  models,  the  original  versions  apply 
directly,  demonstrated  in  Figure  26. 


Figure  26.  Akaike  Information  Criterion  (AIC)  for  Vector  Phoneme  Models  By  Exam¬ 
ining  the  Diagonal  Covariance  at  Several  Order  Models.  The  subplots  show 
each  of  the  21  monophone  models’  MLE  of  the  average  noise  variance  (dashed 
line).  Note  this  MLE  decreases  with  increasing  model  order  and  the  AIC 
(solid  line)  acts  accordingly,  decreasing  to  a  minimum,  then  increasing. 


This  presentation  of  model  order  selection  for  hidden  filters  is  the  first  known  treat¬ 
ment  using  a  statistical  penalty  function  methodology.  Though  the  interactions  between 
optimal  model  order  based  on  residual  variance  and  model  order  for  best  recognition  is 
unclear,  this  appendix  does  suggest  that  different  phonemes  should  have  varying  order 
models. 
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Appendix  D.  Vector  AR  Modeling  of  Strictly  Stationary  Speech 


While  other  attempts  to  model  spectral  dynamics  have  not  been  overly  successful  for 
speech  recognition  [16,  66],  the  applicability  to  speaker  recognition  has  not  been  examined. 
It  is  intuitive  that  the  correlations  and  context  of  the  changing  phoneme  vectors  contains  a 
source  of  untapped  speaker  dependent  information.  However,  if  certain  standard  assump¬ 
tions  are  made  of  the  speech  signal  within  a  phoneme,  then  the  following  two  propositions 
explain  the  (negative)  results  of  past  researchers  using  conditional  models. 

Proposition  D.l  Speech  cepstral  coefficients  within  a  phone  are  a  Strict  Sense  Stationary 
(SSS)  vector  process. 


Speech  is  considered  quasi-stationary,  assumed  stationary  over  30-70  msec.  This  time 
relates  to  between  3  and  7  frames  of  data,  using  typical  speech  framing  techniques.  Strict 
sense  stationarity  of  the  speech  samples,  Xt,  implies  that  frames  of  speech  samples  are  also 
a  SSS  vector  process,  Xt,  -  easily  shown  since  different  frames  have  the  same  n-th  order 
density.  The  SSS  characteristic  of  the  Xt  process  is  maintained  even  after  subjected  to 
linear  transforms,  C,  (e.g.  Fourier)  and  memoryless  systems  transformations  (like  square- 
law,  mel  and  log).  Thus,  the  cepstral  vector  process,  Ct  =  £[log(£[X<]^)]  is  strict  sense 
stationary. 

If  a  vector  autoregressive  model  (for  each  hidden  state)  is  used,  then  the  following 
proposition  results.  Let  the  state  dynamics  of  the  cepstral  vectors  be  defined  by 

p 

Q  —  M  ^  ]  AtCt—i  -|-  F'i 
i=l 

with  the  noise  (Et)  being  a  white,  normal  vector  process. 


105 


Proposition  D.2  During  periods  of  stationarity  (within  a  phone),  the  reestimation  of  a 
P-th  order  vector  autoregressive  model,  given  observations  of  cepstral  coefficients,  results 
in  non-unique  solutions.  Two  trivial  solutions  possible  are: 

1.  (Trivial  filter)  /2  =  0  and  Ai  =  —I,  Ai  =  0,i  =  2, . . .  P 

2.  (HMM)  p  =  Ji*  and  Ai  =  0,  i  =  I, . . .  P 

For  the  i-th  state  and  based  on  the  assumed  model,  the  state  conditional  density  can 
be  specified, 

I  (  I  P  P 

=  (2^yd/^^|i7i  -  fi  -{-J^AiCt-if'S-^Ct  -  Pi +  Y, 

and  then, 

p 

[q  |Q— 1 )  2)  •  •  •  )  Q— p]  p  ^  ^ 

<=1 

The  stationary  characteristic  of  the  cepstral  vector  process  implies  the  unconditional  expec¬ 
tation,  E[c(t)]  is  constant,  denoted  by  p*.  Now,  given  a  sequence  of  a  particular  stationary 
phoneme  process,  the  expectation  of  Equation  62  over  all  past  observations,  is 

p 

■E' [-E^t [cj ,  jcf ,  c<_2,  •  •  • )  c<_p]]  —  E/[/i] ^  ]  ■^jEl[cf _j] 

i=l 

=  p  +  Y^AiP*  =  p*.  (63) 

i=i 

Hence, 

V 

p  =  p* -Y^Aip* 
i=i 

and  when  Ai  =  0,  then  p  =  p*  which  is  the  standard  Gaussian  hidden  Markov  model. 
The  reestimation  will  attempt  to  model  this  behavior  for  each  state.  In  doing  so,  the  two 
trivial  solutions  in  the  proposition  are  easily  seen  to  be  true  based  on  Equation  63. 
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Appendix  E.  Syntactic  Explanation  for  Eorced  Viterhi 

The  ability  to  model  a  sequence  of  symbols  has  been  shown  to  reduce  entropy  [42] 
and  guarantee  a  reduction  in  probability  of  error  [73].  Statistical  and  syntactic  pattern 
recognition  provides  a  foundation  for  classifying  targets  which  have  an  inherent  stochas¬ 
tic  grammar,  Q.  This  grammar  induces  a  set  of  possible  observation  sequences  called  a 
stochastic  language,  [125].  When  given  a  set  of  hidden  Markov  models,  a  hierarchy 

of  constraints  can  be  placed  on  the  Viterbi  decoding  process,  in  effect  changing  the  gram¬ 
mar.  The  grammar,  in  turn,  changes  the  size  of  the  language.  For  speaker  recognition, 
best  results  occur  when  a  forced  Viterbi  decoding  is  used  over  alternative  methods  such  as 
word  grammar,  word-pair  grammar  or  simple  phoneme  decoding.  This  section  provides  a 
mathematical  explanation. 

For  example,  consider  the  following  four  grammars,  each  a  language  level  constraint 
on  the  Viterbi  decoding  process.  The  first  are  two  methods  using  phoneme  based  gram¬ 
mars. 

•  Cpv,  Constrain  all  phonemes  to  a  transcription(Forced  Viterbi) 

•  No  constraints  on  phonemes  (NoGrammar) 

These  next  set  are  based  on  word  models.  First,  a  dictionary  is  created  such  that  words 
are  defined  by  a  fixed  sequence  of  phonemes,  with  optional  silence. 

•  Cwp ,  Constrained  phonemes  within  words  and  constrained  word  pairs  ( WordPair) 

•  CwG^  Constrained  phonemes  within  words  (WordGrammar) 

For  the  statistical  approach  with  several  monophone  models,  let  A  represent  the 
overall  speaker  model.  Since,  recognition  scores  change  several  orders  of  magnitude  based 
on  the  word  sequence  along,  explicitly  show  this  variable  into  the  Viterbi  score.  Denote 
the  word  sequence  by  W.  Viterbi  provides  the  joint  likelihood  score  of  the  observation  and 
the  maximum  likelihood  word. 


p{0,...0t\Ac,C,W) 


nmxp(Oi . . .  Ot,  W\K^,  C) 
m^xp(Oi . . .  Ot|W,  a,,  C)p{W\K,C) 


(64) 
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In  order  to  compare  speakers  using  Viterbi  decoding,  this  second  term,  must  be  the 
same.  When  using  forced  Viterbi  decoding,  the  set  of  models  is  fixed  and  the  size  of 
the  language  |£|  =  1.  However,  any  other  grammatical  approach  will  incur  a  different 
multiplicative  expression  based  on  the  complexity  of  the  grammar.  Any  other  method 
such  as  phoneme  decoding,  word  grammar  or  word-pair  grammars,  which  subsequently 
induces  a  larger  language  £,  will  be  comparing  two  speakers  on  potentially  different  word 
and  phoneme  sequences. 

We  demonstrate  that  as  the  language  increases  by  choice  of  grammar,  the  entropy 
(bits/phn.  Equation  65)  increases  and  this  results  in  increased  equal  error  rate,  shown 
in  Table  19  and  Figure  27.  Figure  27  further  demonstrates  the  relationship  by  plotting 
entropy  (dashed)  against  EER  for  male  and  female  speakers  separately.  For  this  demon¬ 
stration,  cohorts  were  not  used  specifically  to  examine  the  overlap  between  true  claimant 
and  impostor  scores  without  any  normalization.  Recall  equal  error  rate  occurs  when  the 
false  acceptance  error  rate  (impostor  errors)  equals  the  false  rejection  error  rate  (true 
claimant  errors).  Using  Levinson’s  definition  of  entropy  [72], 

Hie)  =  (65) 

which  uses  the  size  of  the  language  |£|  and  the  average  number  of  words  per  utterance 
converted  to  bits  per  phoneme.  Table  19  shows  the  size  of  the  language  with  entropy  for 
the  various  grammars.  In  the  table,  m  is  the  number  of  phoneme/  word  choices  at  each 
time.  The  E[n]  is  the  expected  number  of  phonemes/  words  during  an  utterance  and  |£| 
denote  how  many  possible  paths  exist  through  the  grammar.  All  quantities  have  been 
converted  to  phoneme  units  for  calculation  of  entropy. 

In  summary,  by  changing  the  grammar  or  syntax  allowed  by  Viterbi,  different  size 
stochastic  languages  are  created.  For  automatic  speech  recognition,  these  language  con¬ 
straints  insure  the  recognition  fits  semantically  acceptable  speech.  However,  these  larger 
languages  also  allow  impostors  to  find  better  paths  through  the  language,  which  may  not 
fit  the  semantics  of  the  transcriptions.  By  using  likelihoods  of  observations,  we  must  insure 


108 


Table  19.  YOHO  Language  Constraints,  where  the  language  allows  an  average  of  E[n\ 
symbols  from  a  set  of  m.  \C\  denote  how  many  possible  paths  exist  through 
the  grammar.  All  entries  for  entropy  were  converted  to  bits  per  phoneme  using 
an  average  of  2.9  phonemes  per  word. 


Grammar  Q 

m 

E[n] 

Language  |£| 

H{G) 

Transcription 

1 

1 

1 

0.0 

WordPair 

57 

3 

1.85e+5 

0.60 

Word 

16 

6 

1.67e+7 

0.83 

NoGrammar 

21 

29 

5.5c“l“38 

3.5 

Figure  27.  Equal  error  rates  (Dashed)  for  males  (left)  and  females  (right)  over  the  YOHO 
database  with  one  combination  per  test  trial.  Also  shown  is  Entropy  (Solid) 
in  bits/phoneme  of  the  language  induced  by  the  grammar. 

that  the  conditioning  of  word  and  phoneme  sequences  is  identical  for  all  Viterbi  scores  used 
in  recognition. 
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Appendix  F.  Language  Hypothesis 


In  this  appendix,  a  hypothesis  concerning  ergodic  hidden  Markov  model  use  for 
speaker  recognition  is  proposed  and  demonstrated  experimentally.  Experimental  results 
of  Poritz  [92,  113,  115,  124]  and  added  interpretations  by  Levinson  [73]  also  support  this 
assumption.  In  detailing  methods  of  speech  recognition,  Levinson  interprets  the  experi¬ 
ments  of  Poritz  as  representing  the  structure  found  in  the  symbols  of  the  language.  He 
further  substantiates  this  interpretation  by  reference  to  1)  English  text  modeling  using  an 
ergodic  HMM  framework  by  Cave  and  Neuwirth,  and  2)  originally  by  Markov,  himself,  for 
analyzing  printed  Russian  text. 

Proposition  F.l  An  ergodic  hidden  Markov  model  A  trained  with  unlabeled  speech  to 
model  a  speaker  will  represent  language  model  statistics  in  the  Markov  state  transition 
matrix. 

Unless  a  predefined  transition  structure  is  provided,  the  state  densities  will  model  speaker 
dependent  spectra,  but  the  transitions  between  these  spectra  will  be  language  dependent  as 
evidenced  by  Levinson  and  references  within.  To  demonstrate  this  experimentally,  compare 
the  steady  state  probabilities  of  a  trained  ergodic  hidden  Markov  model  to  the  statistics 
of  the  broad  class  transcriptions. 

Beginning  with  phoneme  labels  (Table  20),  transform  the  automatic  phoneme  seg¬ 
mentation  to  broad  class  labels  and  estimate  the  bi-class  probabilities.  Using  unlabeled 


Table  20.  Broad  Class/  Phoneme  Relation 


Broad  Class 

Phoneme 

Vowel  (V) 
Liquid/Glide  (L) 
Nasal  (N) 
Consonant  (C) 
Silence  (S) 

lY  IH  EH  AX  AH  UX  UH  AO  EY  AY 
R  WER 

N 

(DX)  T  K  V  F  TH  S 
sp  sil 

data,  the  Baum- Welch  algorithm  reestimated  the  parameters  of  two  ergodic,  five-state 
hidden  Markov  models.  These  systems  included  a  HMM  based  on  Mel  frequency  cepstral 
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features  (with  regression)  and  a  third  order  hidden  filter  Markov  model  (similar  to  Poritz). 
The  resulting  transition  matrices  were  extracted  and  stationarity,  p{qt  =  i),  probabilities 
were  analyzed  (Table  21). 


Table  21.  Steady  State  Language  Statistics. 


Vowel 

Liquid-Glide 

Nasal 

Consonant 

Silence 

.19 

.05 

.10 

.19 

.47 

The  steady  state  probabilities  from  the  five-state  ergodic  HMMs,  using  both  Poritz 
and  Mel  frequency  cepstral,  can  then  be  compared  in  Table  22. 


Table  22.  Learning  the  Language  with  Ergodic  Models 


Ergodic  Poritz  Method  (5  state) 
Ergodic  Gaussian  HMM  (5  state) 

.19  .01  .18  .26  .36 

.19  .01  .14  .25  .41 

While  the  automatic  phoneme  transcriptions  will  not  be  precise,  the  similarity  of 
the  broad  language  statistics  of  the  data  to  the  learned  model  stationary  state  statistics 
is  remarkable.  Based  on  Poritz  initial  experiments,  Levinson’s  clarification  of  these  results 
with  historical  ergodic  interpretations  and  these  YOHO  results,  the  explanation  of  past 
“failures”  in  useful  transition  modeling  is  complete.  Recall  the  reference  to  Nolan  [85] 
in  Chapter  I,  who  overviews  several  researchers  claiming  that  in  addition  to  the  vocal 
anatomy,  voice  differences  are  the  result  of  neural  patterns  and  habits.  These  manifest 
themselves  in  the  acoustic  signal  through  coarticulation  effects  and  formant  dynamics. 
Effective  speaker  recognition  strategies  should  monopolize  on  these  dynamics. 


Ill 
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